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SUMMARY SYNTHESIS OF TQRIS VALIDATION STUDIES 


SUMMARY 


The Race to the Top—Early Learning Challenge (RTT-ELC) grants program, sponsored by 
the U.S. Departments of Education and Health and Human Services, aimed to improve children’s 
access to high quality early care and education. RTT-ELC awarded more than $1 billion over 
three rounds of grants to help states develop and implement systems that rate early learning and 
development programs on quality and help them improve. These systems are known as tiered 
quality rating and improvement systems (TQRIS). 


To strengthen the quality of early learning and development programs, TQRIS rate programs 
on quality standards and publicize the ratings of individual programs. States can use these ratings 
to identify low quality programs that need to improve, and parents can use the ratings to choose 
high quality programs for their children. However, the usefulness of the ratings for these 
purposes depends on how accurately they measure programs’ quality, that is, their validity. A 
key objective of RTT-ELC was for states to study the validity of their ratings. 


To inform states’ continued development of TQRIS and future validation studies, this report 
synthesizes findings from validation studies conducted by nine states that received RTT-ELC 
grants. It also describes the challenges that researchers faced when conducting these studies. 
Based on studies from the nine states and interviews with the researchers who conducted them, 
the following key findings emerged: 


e All nine states used external measures of quality to examine the validity of their ratings; 
eight used an independently collected measure of program quality and eight used at least one 
measure of children’s outcomes. 


e The ratings distinguished between programs with differing quality; higher-rated programs 
had higher scores on independent measures of quality. However, the overall level of quality 
for higher-rated programs could not be described as high based on these independent 
measures. 


e The ratings were not related to differences in children’s outcomes; children who attended 
higher-rated programs did not have better developmental outcomes than those attending 
lower-rated ones. 


e Researchers from all nine states reported that their non-experimental designs limited the 
interpretation of findings. In addition, most researchers perceived recruiting child care 
providers for the studies and attaining sufficient representation across the rating levels as the 
most challenging aspects of the studies. 


As TQRIS are refined further, include more programs, and become more fully implemented, 
it will be necessary to conduct additional studies that examine the validity of ratings. 
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I. INTRODUCTION 


High quality early care and education yields 
significant benefits—especially for children from 
low-income and disadvantaged households (Dearing 
et al. 2009). Children who attend a high quality 
preschool for as little as a year can experience 
improvements in their language, literacy, and 
mathematics skills (Yoshikawa et al. 2013). To help 
increase access to high quality programs for children, 
particularly for children with high needs, Race to the 
Top—Early Learning Challenge (RTT-ELC) 
promoted progress on five objectives related to tiered 
quality rating and improvement systems (TQRIS) 
(Box 1). The U.S. Department of Education (ED) and 
U.S. Department of Health and Human Services 
(HHS) awarded the RTT-ELC grants to states through 
three rounds of competition. States received Round 1, 
2, and 3 grants in 2012, 2013, and 2014, respectively. 


The purpose of the grants was to strengthen the quality of early learning and development 
programs by supporting states as they develop and implement TQRIS. States created TQRIS to 
establish standards to define quality, rate programs based on those standards, and publicize the 
ratings of individual programs. The quality ratings are intended to help families select better 
programs for their children and to help states identify and support the improvement of low 
quality programs. The usefulness of the ratings for these purposes depends on whether they are 
valid measures of programs’ quality. Assessing the validity of these ratings helps both to ensure 
the integrity of the system and inform improvements to the system (Kirby et al. 2015). 


There are several potential approaches to validating a TQRIS, including examining the 
evidence underlying individual standards, assessing the reliability and accuracy of the 
information used to construct the ratings, and testing whether differences in ratings correspond 
with differences on other measures of quality and children’s outcomes (Zellman and Fiene 
2012). ED and HHS evaluated states that applied for RTT-ELC grants on the last approach — 
their plans to test whether differences in ratings correspond with differences on measures of 
program quality and children’s outcomes. 


Most validation studies of TQRIS conducted before RTT-ELC have found significant 
relationships between ratings and program quality.! However, validation studies have generally 
found weak evidence of a relationship between ratings and children’s outcomes. Only two of five 
studies that examined individual states’ TQRIS found a positive relationship between the ratings 


' These studies include Malone et al. (2011), Elicker et al. (2011), Lahti et al. (2011), Bryant et al. (2001), and 
Norris and Dunn (2004). Studies of Colorado’s TQRIS (Zellman et al. 2008) and Minnesota’s TQRIS (Tout et al. 
2011) did not consistently find a relationship between the ratings and external measures of program quality. 
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and children’s outcomes.” Another study used existing data from multiple states to simulate 
programs’ ratings under different states’ TQRIS (Sabol et al. 2013). It found these simulated 
ratings had little association with students’ math, prereading, language, and social skills. 


To inform ongoing TQRIS development, the Institute of Education Sciences (IES) at ED 
initiated a study to learn about TQRIS in states that received RTT-ELC grants. The study 
focused on center-based early learning and development programs that served preschool-age 
children (these programs may have also served infants, toddlers, and school-age children). The 
study generated a series of reports and briefs. The first report examined progress on the first 
three TQRIS objectives (outlined in Box 1) by describing the development, structure, and 
characteristics of TQRIS in states that received Round 1 grants. The report found that states 
varied substantially in the ways they promoted participation in TQRIS, defined quality standards, 
verified that programs met the standards, and calculated ratings (Kirby et al. 2017). Future work 
from this study plans to examine states’ progress on the fourth TQRIS objective, examining the 
(1) number and percentages of programs at top levels of the TQRIS, and (2) patterns of TQRIS 
ratings across states with different TQRIS characteristics and policies. 


This report contributes to this larger study by focusing on the fifth RTT-ELC objective— 
validating the effectiveness of the TQRIS. It synthesizes findings across validation studies 
conducted by nine RTT-ELC states: California, Delaware, Massachusetts, Minnesota, Ohio, 
Oregon, Rhode Island, Washington, and Wisconsin (Box 2 includes a list of studies that we 
reviewed). Seven of these states (all except Oregon and Washington) received Round | RTT- 
ELC grants. Oregon and Wisconsin received Round 2 grants. We selected these nine states for 
the synthesis report because they were the first nine RTT-ELC states to complete a validation 
report and either publicly release the report or provide it to include in the synthesis. 


This report also adds to the growing knowledge base about the validity of ratings from 
TQRIS. This knowledge base includes a recent report that synthesized validation studies 
conducted by ten states: Arizona, California, Delaware, Maryland, Massachusetts, Minnesota, 
Oregon, Rhode Island, Washington, and Wisconsin (Tout et al. 2017).* Nine of these states 
received RTT-ELC grants. In that report, the authors of the states’ validation studies described 
findings from individual states and the patterns of findings across states, but they did not 
combine data or calculate averages across states. Like earlier validation studies, Tout et al. 
(2017) found evidence of an association between TQRIS ratings and measures of program 
quality. However, they concluded that evidence of a relationship between the ratings and 
children’s outcomes was inconsistent. 


This independent synthesis report builds on Tout et al. (2017) in several ways. First, this 
report combines data from several states to provide information about the magnitude and 


> These studies include Zellman et al. (2008), Thornburg et al. (2009), Elicker et al. (2011), Tout et al. (2011), and 
Sabol and Pianta (2012). A study of Missouri’s TQRIS found a positive relationship for social skills and behavior, 
but not vocabulary, early literacy, or math skills (Thornburg et al. 2009). A study of Virginia’s TQRIS found a 
positive relationship for literacy skills (Sabol and Pianta 2012). 


3 Nine of these states examined relationships between TQRIS ratings and program quality, and seven examined 
relationships between ratings and children’s outcomes. 
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statistical significance of the average associations between (1) TQRIS ratings and measures of 
program quality, and (2) TQRIS ratings and children’s developmental outcomes across these 
states. Second, this report only analyzes associations between TQRIS ratings and children’s 
outcomes that were based on children who had similar scores at the beginning of each validation 
study; this helps to maximize the likelihood that associations between TQRIS ratings and 
children’s outcomes reflect differences in the quality of the learning environment across 
programs that receive different ratings (as opposed to differences between the types of children 
who attend programs with different ratings). Third, the two reports synthesize slightly different 
sets of states, although they have eight states in common. This report includes Ohio; Tout et al. 
(2017) include Arizona and Maryland.’ Finally, to inform future validation studies about the 
challenges they may face when conducting these studies, this report describes the challenges that 
validation study authors reported, based on interview data we systematically collected from the 
authors. 


Research questions 

To inform states’ continued development of TQRIS and future validation studies, we 
examined the following research questions for the nine RTT-ELC states: 
e How did states validate their TQRIS? 


e Do ratings of TQRIS reflect differences in the quality of programs (based on quality 
measures that researchers collected independently, outside of the TQRIS)? 


e Do children who attend programs with higher ratings have better developmental outcomes 
than those who attend programs with lower ratings? Is there a relationship between 
programs’ ratings and outcomes for all children, or specifically for low-income children? 


e What were the most common challenges associated with conducting the states’ validation 
studies? 


To answer these questions, we reviewed state validation reports and interviewed the 
researchers who conducted the validation studies. 


4 This report examines a total of nine states, and Tout et al. (2017) examine a total of ten states. However, some 
states included in the two syntheses did not analyze both measures of program quality and children’s outcomes. 
Therefore, this report analyzes the associations between TQRIS ratings and measures of program quality for eight 
states (one fewer than Tout et al. 2017) and the associations between ratings and children’s outcomes for eight states 
(one more than Tout et al. 2017). 
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Il. BACKGROUND ON STATES’ TQRIS 


TQRIS began in two states (Oklahoma and Colorado) in the late 1990s but expanded to 39 
states by the end of 2016 (Build Initiative and Child Trends 2016). This report focuses on nine 
states that first implemented TQRIS from 2004 to 2013 (Table II.1). Most of these states had 
TQRIS that were still in flux when researchers started their validation studies; at the start of the 
studies, six states (California, Delaware, Minnesota, Rhode Island, Oregon, and Washington) had 
either not fully implemented their TQRIS or were in the process of changing them substantially. 


TQRIS rate early learning and development programs against state-defined quality standards 
that include components such as licensing compliance, quality of the learning environment, and 
qualifications of the workforce. States define standards differently, based on their priorities, and 
can use different standards and measures to rate different types of programs. For example, states 
use different measures of the quality of the learning environment for programs operated out of 
the caregiver’s home—that is, family child care programs—versus those that operate in an 
institutional setting such as a center—that is, center-based programs. Family child care programs 
are a specific type of home-based care. Home-based care can also include an individual or shared 
sitter or a relative. 


States also set different standards for caring for children of different ages (such as infants, 
toddlers, and preschool-age children). For example, standards for caring for infants and toddlers 
require smaller group sizes than those for preschool-age children. States also may have 
policies—alternative pathways or automatic ratings—that allow eligible programs (such as Head 
Start programs) to be exempted from part or all of the TQRIS rating process because they have 
already demonstrated meeting a comparable set of standards (such as federal standards for 
receiving Head Start funds). 


Based on meeting the state-defined standards, programs receive an overall rating level. The 
number of overall rating levels differs across states; in the nine RTT-ELC states in this report, 
rating levels ranged from | to 4 in two states and 1 to 5 in seven states. However, rating levels 
are not directly comparable across states, even among states with the same number of rating 
levels, because states define quality standards and calculate rating levels differently (Kirby et al. 
2015). The commonality across states is that each rating level signifies higher quality than the 
rating level below it; for example, programs rated level 3 are expected to be higher quality than 
programs rated level 2. 


Differences in TQRIS across the nine states may affect how states conducted their validation 
studies, how much their ratings distinguish between high and low quality programs, and whether 
there is a relationship between ratings and children’s outcomes. The differences across the nine 
states include the following: 


Types of programs that participate. The researchers who conducted states’ validation 
studies had to select the types of programs to study; they could study only programs that states 
allowed to participate in their TQRIS. In two of the nine states, participation in TQRIS is 
voluntary for all programs. Other states require certain types of programs to participate (for 
example, programs that receive public funds to serve low-income children) but may allow other 
programs (such as family child care programs) to participate voluntarily. 
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The percentage of programs that participate in TQRIS varies across states and program 
types. The nine states did not consistently provide information about the rates of participation in 
TQRIS. Only four states—California, Delaware, Minnesota, and Ohio—provided information 
about TQRIS participation over the study period. In those states, participation rates for licensed 
centers (which ranged from about 20 to 70 percent) exceeded those for family child programs 
(which ranged from less than 3 percent to 25 percent) (see Appendix C for state-specific 
participation rates over the study period). 


States use different measures of quality for family child care programs, compared with 
center-based programs. In addition, family child care programs typically have children with a 
mix of ages because one caregiver is working with a group of children, whereas center-based 
programs often group children by age. 


Defining standards. States define the quality standards that programs must meet to earn 
each rating; those with standards that focus more on classroom-specific aspects of quality than 
on administration and management might find stronger relationships between ratings and 
children’s outcomes than other states (Burchinal et al. 2016; Sabol et al. 2013). Standards can 
cover many different aspects of quality, but some relate more closely to what happens in the 
classroom, such as curriculum and the quality of the learning environment. All nine states 
included in the synthesis have standards in categories related to children’s learning and 
development, though the terminology varies. Other standards pertain more to administration and 
management, which might not affect children’s outcomes as directly. Four of the nine states 
(Delaware, Massachusetts, Ohio, and Oregon) have standards related to administration and 
management, assessing programs on characteristics such as their written operating policies and 
procedures and financial record-keeping (Build Initiative and Child Trends 2016). 


Verifying that programs meet the standards. After defining the quality standards, states 
must collect information to verify that programs meet these standards. States conducting validity 
studies had to identify additional, independent quality measures to use to examine differences 
across ratings. States differ in how they collect the information used to determine whether 
programs meet the standards. Depending on the indicator being verified, states may rely on self- 
reported information from programs, document reviews, and information in existing databases 
(Kirby et al. 2017). All of the states include an observational assessment of the classroom 
environment or teacher—child interactions as part of the rating process for certain rating levels. 


Number of rating levels. States designing a validation study had to consider whether to 
examine differences across all of their individual rating levels or collapse levels into one higher- 
rated group and one lower-rated group. For this study, we also had to determine how to compare 
rating levels across states that used different numbers of levels. Based on meeting the standards, 
states give programs an overall rating that can range from | to 4 or 1 to 5. Two states have four 
levels, and the remaining seven have five. 


Rating structure. The rating systems that determine programs’ rating levels could weaken 
the relationship between the ratings and outcomes. TQRIS use one of three rating structures to 
determine a program’s rating level (Figure II.1). Building block structures require programs to 
meet all standards within a level to receive a rating. In contrast, points and hybrid structures 
provide programs the flexibility to choose the standards they meet to earn a higher rating, as long 
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as programs receive enough points. In block systems, a program that missed qualifying for the 
next level based on a single standard might not be that different in quality from programs at the 
next level. In points and hybrid structures, programs that receive a given rating level could have 
met different standards. Three states use a building block structure, one state uses a points 
structure, and the other five states use a hybrid structure. 


Figure I1.1. Rating structures in TQRIS 
Building block Points Hybrid 
owe. =e 
,+0 pa os 


Programs must meet all Programs earn points for Programs are rated by a 
standards in a level to meeting standards. The total building-block structure in 
receive that rating. number of points determines some levels (usually, lower 
the rating. levels) and a points structure in 


others (usually, higher ones). 


Automatic ratings. Automatic ratings for programs that meet quality standards external to 
TQRIS could weaken the relationship between the ratings and outcomes. Some states 
automatically award high rating levels to programs that meet external quality standards (for 
example, programs accredited by professional organizations, such as the National Association of 
Education for Young Children, or Head Start programs). Automatic ratings can ease the burden 
of the full data collection and verification process for programs that likely would have met all the 
requirements for the highest rating had they gone through the full process. However, if 
automatically rated programs would not have obtained the highest rating through the full process, 
offering automatic ratings could weaken the relationship between the ratings and outcomes. 
Three of the nine states offer automatic ratings to accredited programs and one state offers 
automatic ratings to Head Start programs. 


Given these differences, it is possible that the relationships between ratings and outcomes 
could differ across states. 
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Table II.1. TQRIS characteristics, by state 


California> 


Delaware 


Massachusetts 


Minnesota 


Ohio 


Oregon 


First year 


of 
Co) ol-Veclirey a} 


2012 


2008 


2011 


2007 


2004 


2013 


Types of 


programs that can 


participate 
Voluntary 


Eligible programs 
include licensed 
centers 


Eligible programs 
include licensed 
center-based and 
family child care 
programs 


Voluntary 


Eligible programs 
include licensed 
center-based and 
family child care 
programs 


Eligible programs 
include licensed 
center-based and 
family child care 
programs 


Categories on which 
programs are rated? 


Child development and school 
readiness; teachers and 
teaching; and program and 
environment 

Family and community 
partnerships; qualifications and 
professional environment; 
management and 
administration; and learning 
environment and curriculum 
Curriculum and learning; safe, 
healthy indoor and outdoor 
environments; workforce 
qualifications and professional 
development; family and 
community engagement; and 
leadership administration and 
management 

Physical health and well-being; 
teaching and relationships; 
assessment of children’s 
progress; and teacher training 
and education 

Learning and development; 
administrative and leadership 
practices; staff qualifications 
and professional development; 
family and community 
partnerships; and staff:child 
ratio and group size and 
accreditation 

Children’s learning and 
development; health and 
safety; personnel 
qualifications; family 
partnerships; and 
administrative and business 
practices 


5 Points 


Number 
of rating 
levels 


Rating 
structure rating 


Not available 


5 Started as block Available for 


(2008), changed accredited 


to points (2012), programs 
changed to 
hybrid (2015) 
4 Block Not available 
4 Hybrid Available for 
accredited 
programs 
5 Hybrid Not available 
5 Block Not available 


mAWEVIELs yi Ave) if 
automatic 


UY - We) i 
fo) of-¥-TaNe- Lice) ar 1 | 
measures in 
rating process 


Used for rating, 
specific score 
required for points 
or level 

Used for rating, 
specific score 
required for points 
or level 


Used for rating, 
specific score 
required for points 
or level 


Used for rating, 
specific score 
required for points 
or level 


Used for rating, no 
specific score 
required 


Used for rating, 
specific score 
required for points 
or level 
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Table II.1. (continued) 


Use of 
First year Types of Number PWEVIEL Sy AVae) | observational 


of programs that can Categories on which of rating eile | EVOL Colair-lile measures in 
operation participate programs are rated? levels structure e-hatare) elit ale mm e)Keler-t-4-) 


Rhode Island 2009 Eligible programs Learning environment; Not available Used for rating, 
include licensed minimum staff:child ratio; specific score 
center-based and maximum group size; teacher required for points 
family child care qualifications; program or level 
programs leadership; continuous quality 


improvement; curriculum; child 
assessment; inclusive 
classroom practices; and 
family communication and 
involvement 


Wisconsin Licensed center- Education and training Hybrid Available for Used for rating, 
based programs qualifications; learning accredited specific score 
and family child environment and curriculum; programs required for points 
care programs can __ professional and business or level 
apply practices; and children’s health 


and well-being practices 


Source: Build Initiative and Child Trends (2016). 
4 Categories are listed using state-specific terms, as recorded in the QRIS Compendium. 
> California’s TQRIS is administered locally; 16 counties in California first implemented TQRIS in 2012. 
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lil. COMPONENTS OF VALIDATION STUDIES 


To shape TQRIS that provide meaningful measures of quality, RTT-ELC states conducted 
studies to validate their systems. These studies could validate the TQRIS by “measuring whether 
the tiers of TQRIS accurately reflect different levels of program quality and whether changes in 
quality ratings are related to children’s progress in learning, development, and kindergarten 
readiness” (U.S. Department of Education 2013). The process for a state’s validation of the 
TQRIS included three main components. 


Designing the study and approach to the analysis. First, researchers had to design a study 
that could assess whether the ratings accurately reflect differences in programs’ quality and 
differences in children’s progress in learning and development. Because children could not be 
randomly assigned to early childhood education programs, researchers had to use non- 
experimental designs to compare outcomes for children enrolled in programs with different 
ratings. These comparisons are meaningful only if the children are similar aside from the 
programs in which they enroll. Otherwise, the comparisons will simply reflect differences in 
family backgrounds or existing skills of children who attend programs with different ratings. To 
ensure that the children being compared are as similar as possible, non-experimental designs 
might use matching or statistical controls to adjust for differences between children. However, it 
is still possible that these techniques might miss an important difference between children in 
different programs (such as parent involvement) that is not measured but affects children’s 
outcomes. 


As part of the study design, researchers also had to decide whether to group programs into 
higher- and lower-rated categories for comparison or to examine differences across each 
individual rating level. Researchers might decide to group programs into higher- and lower- 
rating levels if there were not enough programs at individual rating levels to examine the ratings 
separately, and they might combine rating levels that were similar in quality (such as combining 
levels 1 and 2 and combining levels 3 and 4 in a state with four levels). The decision concerning 
whether to group programs into rating level categories might affect the number of programs they 
chose to recruit in each rating level; researchers could also make this decision during the analysis 
phase, after they knew how many programs they had recruited. 


Recruiting programs and families to obtain a sample of sufficient size and 
representativeness. After finalizing the study design, researchers had to recruit programs and 
families to participate in the validation study. To detect differences in program quality across 
rating levels, researchers had to secure the participation of a sufficient number of programs that 
represented the various rating levels. Researchers also had to determine whether to recruit all 
program types, including family child care programs, or focus on center-based programs because 
of their prevalence. To examine differences in children’s outcomes, researches also had to recruit 
the children enrolled at the study programs and obtain their parents’ consent to administer the 
assessments. 


Selecting measures, collecting data, and conducting the analysis. In the final stage, 
researchers had to select external measures of program quality and child development. They also 
had to administer the measures and analyze differences in outcomes on the measures across 
rating levels. 
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All components had to be completed within time frames allotted for the study. As explained 
earlier, at the time of the validation studies, some states had not fully implemented their TQRIS. 
This timing could affect researchers’ schedules for recruitment and collecting data, as well as 
interpreting findings. 
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IV. DATA AND METHODS 


Boxes 2 and 3 discuss the data and methods used in this report. We describe in Box 2 the 
data we collected from the TQRIS validation studies and from our interviews with the 
researchers who conducted them. We explain in Box 3 our methods for synthesizing findings 
across the validation studies.° 


Box 2. Data from state validation studies and interviews with researchers 
This report uses two main sources of data: 
State validation studies 
This report reviewed the following TQRIS validation studies from nine RTT-ELC states: 


California: Quick et al. (2016a, 2016b) 

Delaware: Karoly et al. (2016) 

Massachusetts: Roberts et al. (2016) 

Minnesota: Tout et al. (2016) 

Ohio: Heinemeier et al. (2017) 

Oregon: Lipscomb et al. (2016) 

Rhode Island: Maxwell et al. (2016) 

Washington: Soderberg et al. (2016) 

Wisconsin: Magnuson and YingChun (2015, 2016) 


As we reviewed the validation reports, we documented information about the validation studies (such 
as types and numbers of programs included and methods used). We also recorded the statistics needed to 
calculate differences in program quality and child development between two groups of rating levels: high and 
low. For each group, these statistics include sample sizes, means and standard deviations on baseline and 
follow-up assessments, and regression-adjusted means or regression coefficients. If the reports did not 
include all of this information, we contacted the researchers to request it. We never asked researchers to 
conduct additional analyses. 


Interviews with authors 


To understand the challenges experienced by researchers conducting the TQRIS validation studies, we 
conducted a 30-minute semistructured phone interview with principal investigators for each of the nine 
states. 


We first examined validation study reports and existing documentation of TQRIS validation study 
experiences (Lahti et al. 2013) to develop a list of specific challenges principal investigators encountered 
and grouped them into four broad categories: 


1. Study design and analysis 

2. Sample size and representativeness 

3. Selecting program and child measures and collecting data 
4. Study schedule and timing relative to implementing TQRIS 


We then developed a semistructured interview protocol design to collect standard information across 
the nine studies—whether a specific challenge had been experienced and, if so, whether it was considered 
major or minor—along with more detailed descriptions of all challenges experienced while conducting 
TQRIS validation studies. 


> Our analysis approach follows the What Works Clearinghouse (WWC) standards version 3.0 (U.S. Department of 
Education 2014). The WWC released version 4.0 of their standards in October 2017 after we had collected data 
from author queries and conducted our analyses. See Appendix B for details on the differences between the two 
versions as they relate to our analysis. 
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Box 3. Methods for synthesizing results of validation studies 


This box summarizes how we synthesized results of validation studies. For more details about these methods, 


see Appendix B. 


The nine states had different numbers of rating levels and used different methods and outcome measures in 


their validation studies. To combine findings across these states, we did the following: 


1. 


Defined two groups of rating levels: high and low. To compare findings across states that had different 
numbers of rating levels (either four or five), and across states that had already combined rating levels into two 
groups and those that had not, we defined two groups of consistent rating levels. These groups followed the 
definitions used by the states that already combined rating levels: 


e —_ High (a rating level of 3 or higher) 
e ~—_ Low (a rating level of 1 or 2) 


Classified outcome measures into domains. To compare findings across states that used different 
assessments, we classified outcome measures into domains based on the constructs they assessed. For 
example, the language development domain included children’s scores on Letter-Word Identification on the 
Woodcock Johnson Tests of Achievement and the Test of Preschool Early Literacy. 


For each outcome measure, calculated the average difference between programs with high ratings and 
those with low ratings in each state. We used different methods to calculate the standardized average 
differences for the program quality and child outcome measures. States also used different methods for each 
type of measure. For child outcome measures, measuring them at the beginning of the year (baseline) enabled 
researchers to try to account for differences between the types of children who attend programs with different 
ratings. 


a. For program quality measures, we calculated the standardized difference. For program quality 
measures, we took the average difference in program quality between higher- and lower-rated programs, 
and divided it by the standard deviation of this measure (calculated across all programs). 


b. For child outcome measures, used estimates based only on similar children. For each child 
development measure, we examined whether children had similar scores on the baseline assessments. 
We did not use outcomes if children had baseline differences that exceeded the What Works 
Clearinghouse (WWC) baseline equivalence standards. Among estimates that met these WWC thresholds, 
we used regression-adjusted estimates to calculate the standardized average difference. If those were not 
available, we subtracted any differences between higher- and lower-rated programs at the beginning of the 
year from the differences at the end of the year. In both cases, we standardized the average difference by 
dividing by the standard deviation. 


Characterized the statistical significance, sign, and magnitudes of findings. We reported: 


e For each state, the difference between higher- and lower-rated programs, averaged across all measures 
in the domain, and its associated 95 percent confidence interval. This average summarizes findings within 
states. The confidence interval indicates a significant association if its bounds do not include zero. We 
followed WWC procedures to calculate these averages and confidence intervals. 


e Across states, the average difference between higher- and lower-rated programs and its associated 95 
confidence interval. This average summarizes findings across states. Because the quality of programs in 
each rating level could differ dramatically across states, we calculated this average using a statistical 
approach that weighted each state by a measure of the confidence in the average (a fixed-effect meta- 
analysis that weights states by their inverse variance). 


Conducted a supplemental analysis that compared only the highest and lowest rating levels possible. 
As a supplemental analysis, for states that reported disaggregated statistics by rating level, we repeated steps 
1 through 4 to compare only the highest and lowest rating levels. This analysis provides information about 
differences in outcomes for the largest contrast of rating levels. 
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V. APPROACHES FOR STATES’ VALIDATION STUDIES 


All nine states included in this synthesis used a non-experimental design for their validation 
studies. However, their studies included different types of programs and used different measures 
and methods (Table V.1). We classified the measures into domains based on the constructs they 
assessed. These domains include program quality or separate types of child outcomes, such as 
alphabetics or social-emotional development. (See Appendix A, Table A.1 for more details on 
the definitions of the domains and specific measures categorized in each domain and Appendix 
A, Table A.2 for the specific outcome measures used for the TQRIS validation study in each 
state.) See Appendix C for a detailed description of each state’s validation study and its findings. 
The descriptions in Appendix C include information that was reported systematically in all 
studies such as the number and types of programs in the sample (and, if available, how the 
sample of programs in the validation studies compare with the full sample of programs that 
participate in TQRIS or the full sample of programs in the state), ages of children in those 
programs, characteristics of the sample, and the statistical approach the authors took. For 
information not included in Appendix C, please refer to the individual state validation studies 
listed in Box 2 and cited in the “References” section. 


Most states included both center-based and family child care programs, but the 
number of programs included varied across states. Six states included both center-based and 
family child care programs. Only three states (California, Massachusetts, and Rhode Island) 
focused entirely on center-based programs. The number of programs included in the studies 
ranged from about 70 (Ohio and Rhode Island) to about 300 (Minnesota and Oregon). 


Three states combined center-based and family child care programs in their analyses. States 
use different measures of program quality for center-based and family child care programs (for 
example, they might use the Early Childhood Environmental Rating Scale-Revised for center- 
based programs and the Family Child Care Environment Rating Scale for family child care 
programs). However, the states that combined center-based and family child programs in their 
analyses argued that it was appropriate to group programs together when assessing validity of the 
TQRIS ratings. 


Eight of the nine states used an independent measure of program quality to examine 
differences between programs, but they used different methods to compare programs. Only 
Washington did not use an independent measure of program quality to examine differences 
across rating levels, opting to focus solely on differences in children’s outcomes across rating 
levels. 


Researchers most commonly observed the learning environment in a subset of classrooms 
within programs once during the study period.° Drawing upon widely used rubrics, such as the 
Classroom Assessment Scoring System (CLASS) and the Environmental Rating Scales (ERS), 
they rated interactions in the learning environment such as those between staff and children; 
among staff, parents, and other adults; between children; and between children and the materials 
and activities in the classroom. (See Appendix A, Table A.3 for more information about the 


® For some particular programs, all classrooms in the program were observed because, for example, the study 
observed up to four classrooms per program and the program had four or fewer classrooms. 
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scores and predictive validity of the rubrics.) For some rubrics, researchers also rated physical 
components of the learning environment, including the space, schedule, and materials that 
support interactions in the classroom. 


Most states used one to three measures of program quality, but Ohio used seven. In some 
cases, the external observations were coded using the same rubrics (such as CLASS or ERS) as 
the observations collected for the TQRIS ratings. Four states’ validation studies used the same 
rubric for at least one external measure, and two states’ studies used a different rubric. The other 
two states’ studies used the same rubric as the TQRIS, but argued that, in practice, none or very 
few of the programs in their studies had actually been rated on that rubric by the state (for 
example, because the state did not require observations for programs rated below the highest 
level). We might expect to find stronger relationships between ratings and measures of program 
quality that states use in their TQRIS, compared with those they do not. 


Most states compared program quality outcomes for individual rating levels and did not 
adjust for other program characteristics. Five of the eight states that used an independent 
measure of quality compared programs in each rating level to programs in every other level.’ 
The other three states combined programs in individual rating levels into two groups with higher 
and lower ratings for comparison. Two states controlled for other program characteristics in their 
comparisons of program quality by rating level.* The remaining six states simply compared 
average scores without conducting any statistical adjustments. 


Eight states used at least one assessment of children’s outcomes, but their methods for 
comparing programs differed. Only Oregon had not reported findings for children’s outcomes 
at the time that we conducted this synthesis. In the other eight states, researchers collected 2 to 
12 measures of children’s outcomes; for each, they administered two assessments to children 
(once at baseline and again at a follow-up period).’ The time between the baseline and follow-up 
period was less than 12 months for all states. Researchers most frequently assessed children’s 
alphabetics, cognition, and mathematics skills, but a few states collected data on children’s 
socioemotional development and motor skills. The assessments included the Woodcock Johnson 
II Letter-Word identification subscale, the Woodcock-Johnson III applied problems subscale, 
and the Preschool Learning Behaviors Scale. 


Most states compared children’s outcomes for individual rating levels and adjusted for 
children’s performance at baseline. Only two states combined programs in individual rating 
levels into two groups for comparison. Six of the states accounted for children’s performance at 
the beginning of the study when comparing their outcomes at the end of the study across rating 


7 Delaware combined the Starting with Stars and 2-star rating levels but otherwise compared individual rating 
levels. 


8 For comparability across states, we used findings without controls in our analyses, although it is possible that 
including control variables could reduce the size of the differences between higher- and lower-rated programs. See 
the note in Figure VI.1 for details on the two states that reported findings with controls. 


” One state attempted to collect child development outcomes from all 3- and 4-year-old children in each site. Among 
the remaining states collecting child development outcomes, four states collected outcomes from a sample of 
children within programs, and four states collected outcomes from a sample of children within a single selected 
classroom or a subset of classrooms within each program. 
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levels. The specific child and family characteristics that researchers included as statistical 
controls in analyses varied widely across states; no two validation studies included the same set 
of covariates in their analyses. Many states (six of eight that analyzed children’s outcomes) used 
a statistical approach (multilevel modeling) to account for the shared experiences of children 
within programs. 


To examine whether attending a higher-rated program was beneficial for children from low- 
income households, four states separately analyzed outcomes of low-income children. High 
quality programs might be especially important for these children with fewer resources and 
opportunities to promote their development. 
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Table V.1. Summary of approaches states used to examine measures of program quality and children’s 
outcomes 


Program quality ed nie MolUi Coxe) uit =) 
Time period 
Number of Analysis Analysis for Supplemental 
external combined Number of Coxeanle)farcyel Controls for measuring subgroup 
Number and measures of ratinglevels Controls for otal Ce} rating levels Controls for other child change in analysis for 
type of program into two program outcome into two baseline and family child low-income 
programs quality groups characteristics | measures groups performance characteristics development children 

California 166 center- 1 x 4 x x Fall to spring 
— _ based programs : — = = = 7 = = 7 = a ee: _ 
Delaware 156 center- 3 x 5 xX x Fall to spring x 

based and family 

child care 
po eT : = 
Massachusetts 120 center- 3 x 5 x x Fall to spring 

_based programs ; 
Minnesota 294 center- 3 6 x x Fall to spring x 

based and family 

child care 

programs 
Ohio 72 center-based 7 2 Spring to fall 


and family child 
care programs 


Oregon 304 center- 1 xX 0 n.a. n.a. n.a. n.a. n.a. 
based and family 
child care 
programs 
Rhode Island 71 center-based 1 5 x x Fall to spring x 
programs 
Washington 100 center- 0 n.a. n.a. 12 xX x Fall to spring 
based and family 
child care 
programs 


Wisconsin 239 center- 2 x 7 x x x Fall to spring x 
based and family 
child care 
programs 

Sources: State validation reports for California, Delaware, Massachusetts, Minnesota, Ohio, Oregon, Rhode Island, Washington, and Wisconsin. 


n.a. = not applicable. 
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VI. ASSOCIATIONS BETWEEN RATINGS AND INDEPENDENT MEASURES OF 
PROGRAM QUALITY 


If the ratings accurately reflect different levels of program quality, higher-rated programs 
should score better than lower-rated ones on quality measures collected outside of the system. 
Researchers from eight states tested the validity of the ratings by examining their relationship 
with independent assessments of the classroom environment and teacher—child interactions, such 
as the ERS and CLASS. Our analysis compared the outcomes of higher-rated programs 
(programs with rating levels of 3 or above) with those lower-rated programs (programs in the 
bottom two rating levels). 


Higher-rated programs scored higher on independent assessments of program quality 
than lower-rated programs. Across all states, higher-rated programs scored 0.57 standard 
deviations higher on measures of program quality, on average, than lower-rated programs 
(shown by the blue dot in Figure VI.1). This finding across states was statistically significant 
(shown by the solid 95 percent confidence interval around the blue dot in Figure VI.1 that does 
not include zero). 


In seven of eight states, higher-rated programs had significantly better scores on measures of 
program quality than lower-rated programs (shown by the solid 95 percent confidence intervals 
around the black dots in Figure VI.1 that do not include zero). In the other state (Minnesota), the 
average difference between higher- and lower-rated programs was positive but not significant 
(shown by the dashed 95 percent confidence interval around the black dot in Figure VI.1 that 
includes zero). Across all of the program quality measures used by each state, the range of scores 
was about 1.00 standard deviation; higher-rated programs scored an average of 0.20 standard 
deviations higher than lower-rated programs in Minnesota, and an average of 1.19 standard 
deviations higher in California (as shown by the black dots in Figure VI.1). The average 
difference in program quality varied significantly across states (based on a statistical test for 
heterogeneity called the “Q test’). 


The differences in program quality scores between higher- and lower-rated programs 
might not be large enough to significantly increase children’s learning. Based on the 
standard deviations of the measures that states reported, the average difference in program 
quality across states (0.57 standard deviations) corresponds to a difference of only about half a 
point on the ERS or CLASS. The largest difference (1.19 standard deviations in California) 
corresponds to a difference of about one point on these measures—small enough to leave higher- 
and lower-rated programs in the same classification of scores. 


This program quality difference might not be large enough to significantly increase 
children’s learning. For example, two recent meta-analyses found few associations between these 
measures and children’s outcomes (Perlman et al. 2016, Brunsek et al. 2017). In addition, 
significant positive associations that studies do find tend to be modest—that is, effect sizes that 
roughly correspond to less than 0.10 standard deviations or 1.4 weeks of learning (Howes et al. 
2008, Aikens et al. 2017). 


The overall level of quality for higher-rated programs could not be described as high 
based on the program quality scores. For most states that used ERS as their measure of 
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program quality, the average scores for both higher- and lower-rated programs fell in the 
minimal range (scores of 3 or 4). This is considered better than inadequate (scores of 1 or 2), but 
is still worse than good (scores of 5 or 6) or excellent (a score of 7) (Harms et al. 2015). On the 
CLASS, in all but one of the states, both groups of programs had average scores for emotional 
support in the medium quality range (scores of 3 to 5). Both groups of programs had average 
scores in the low to medium (scores of 1-5) range on the instructional support scale (Pianta et al. 
2008). These scores fall below the high quality range (scores of 6 or 7) on the CLASS. 


Figure VI.1. Difference in program quality between higher- and lower-rated 
programs 


California 3 —___4—___. 
Rhode Island] | _—— 
Wisconsin) : —_——__e—_—_. 
Ohio 7 | ——$—$—$—_$_$ _ 
Massachusetts } : Ss — 

Average] | ——_@—_ 

Delaware} | OF 

Oregon} | —_—_—_—_—__e—_—__—__ 
Minnesota7 -- - —=—§ =@ = -— — -— = 

00 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 


Difference in program quality between higher- and lower-rated programs” 


- -— — Not significant Significant 
Sources: State validation reports for California, Delaware, Massachusetts, Minnesota, Ohio, Oregon, Rhode Island, 
and Wisconsin. 


Note: Washington is not shown in the figure because its validation report did not include analyses of program 
quality measures. Each black dot represents a state’s difference in program quality (measured in standard 
deviations) for higher- versus lower-rated programs. Positive differences mean that higher-rated programs 
performed better than lower-rated programs. The blue dot represents the average difference across states. 
The lines on either side of the dots denote the 95 percent confidence interval. Two states calculated group 
means controlling for region (Wisconsin) and a variety of provider- and neighborhood-level characteristics, 
such as Title | status and the distribution of race in the provider's zip code (Delaware). The standardized 
mean differences using these regression-adjusted means are 0.64 standard deviations in Wisconsin 
(compared with 0.62 in the figure), and 0.26 standard deviations in Delaware (compared with 0.56 in the 
figure). The p-value for the Q test for heterogeneity is 0.01. 


8 The difference in program quality is calculated as the standardized mean difference, which is the difference in 
unadjusted means between higher- and lower-rated programs divided by the pooled standard deviation. The average 
is calculated using a statistical approach that weights each state’s standardized mean difference by a measure of its 
confidence (a fixed-effect meta-analysis that weights states by their inverse variance). 
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Vil. ASSOCIATIONS BETWEEN RATINGS AND CHILDREN’S DEVELOPMENT 
OUTCOMES 


If programs with higher ratings are better at promoting children’s development, children in 
higher-rated programs should have better outcomes than children in lower-rated programs. Eight 
states tested the validity of their TQRIS by examining the relationship between the ratings and 
measures of children’s outcomes in domains that ranged from alphabetics to social-emotional 
development. Our analysis includes findings only when children had similar scores on child 
development measures at the beginning of the study, despite attending programs with different 
ratings; these findings are less likely to reflect differences between the types of children who 
attend different programs. !° 


Outcomes for all children 


In general, children attending higher-rated programs did not have better 
developmental outcomes than children attending lower-rated ones. Across the six child 
outcome domains that had findings for multiple states, children in higher-rated programs did not 
consistently score better, on average, than those in lower-rated programs across states (shown by 
the dashed 95 percent confidence intervals around the blue dots in Figure VII.1 that include 
zero). The average across states was significantly positive in only one domain: comprehension 
(shown by the solid 95 percent confidence interval around the blue dot in Figure VII.1 that does 
not include zero). In comprehension, children in higher-rated programs scored 0.18 standard 
deviations higher than children in lower-rated programs. Without regard to statistical 
significance, across domains, the average differences across states (shown by the blue dots in 
Figure VII.1) were both positive and negative. 


Within each child outcome domain, few states saw significantly better scores for children in 
higher-rated programs, compared with those in lower-rated ones. For six child outcome domains, 
no states saw significantly better scores for children in higher-rated programs (as signified by the 
dashed 95 percent confidence intervals around the black dots in Figure VII.1 that include zero). 
For three domains—comprehension, cognition, and motor skills—only one state in each domain 
saw significantly higher scores for children in higher-rated programs (as signified by the solid 95 
percent confidence intervals around the black dots in Figure VII.1 that do not include zero). The 
other states that analyzed comprehension and cognition did not find significant differences 
between children attending programs with different rating levels. No other state in the analysis 
assessed children’s motor skills. Among domains that had findings for two or more states, 
findings did not vary significantly across states (based on a statistical test for heterogeneity 
called the “Q test”). 


0 Sixty-nine percent (25 of 36) of the child outcome domains that states examined were based on children with 
similar scores at the beginning of the study and, thus, included in this analysis. 
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Figure VII.1. Differences in child development outcomes between children 


attending higher- and lower-rated programs, by domain 


Sources: State validation reports for California, Delaware, Massachusetts, Minnesota, Ohio, Rhode Island, 
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Notes: Each black dot represents a state’s effect size (measured in standard deviations) for higher- versus lower- 
rated programs. Positive effect sizes mean that children in higher-rated programs performed better than 
children in lower-rated programs. Each blue dot represents the average effect size across states. The lines 
on either side of the dots denote the 95 percent confidence interval. The p-value for the Q test for 
heterogeneity is 0.91 for alphabetics; 0.18 for comprehension; 0.43 for general reading achievement; 0.10 
for cognition; 0.26 for mathematics; and 0.62 for social-emotional development. 


4 The difference in children’s outcomes is an effect size (measured in standard deviations), which is calculated based 
on author-provided sample sizes, means, standard deviations, and other regression statistics. The average is 
calculated using a statistical approach that weights each state’s effect size by a measure of its confidence (a fixed- 
effect meta-analysis that weights states by their inverse variance). 
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Most states did not consistently see higher scores for children in higher-rated 
programs, compared with those in lower-rated ones. Only Delaware and Washington saw 
significantly better scores for children attending higher-rated programs in at least one domain 
(Figure VII.2). Delaware found one statistically significant result in cognition but no significant 
result in social-emotional development. In Washington, only two outcomes across seven 
domains were statistically significant. No other states found any significant differences in 
outcomes. On the whole, there is no consistent evidence that children attending higher-rated 
programs scored higher on measures of child development, either across states or in any one 
state. 


One hypothesis for why child outcomes did not differ for children attending higher- and 
lower-rated programs is because the two groups of programs might not have had sufficient 
differences in quality to lead to differences in children’s learning. One might expect to see 
greater differences in children’s outcomes if just the highest and lowest rating levels were 
compared. 


To examine this hypothesis, we conducted a supplemental analysis that used findings from 
the highest and lowest individual rating levels (or groups of rating levels) that states reported. 
For example, if the state reported findings for five individual rating levels, the supplemental 
analysis compared level 1 (lowest) to level 5 (highest). To reduce burden on authors, we did not 
ask them to provide additional information for this analysis, so we used the statistics they 
reported to create the largest possible contrast in ratings (see Appendix B, Table B.1 for the 
comparisons used in each state). This analysis included the six states that reported information 
separately for each rating level, and excluded the two states that did not (because it was not 
possible to compare the highest and lowest rating levels in those states). 


Even when comparing children from the highest- and lowest-rated programs, children 
attending higher-rated programs did not have better developmental outcomes than 
children attending lower-rated ones. For domains where there were enough states to perform 
the supplemental analysis, the estimated differences in outcomes between higher- and lower- 
rated programs increased slightly, but none were statistically significant (see Appendix B, Table 
B.2 for the results). Overall, the main analysis and this supplemental analysis provide no 
evidence that children attending higher-rated programs scored higher on measures of child 
development than children attending lower-rated programs. 


Outcomes for low-income children 


In addition to assessing the relationship between rating levels and child development 
outcomes overall, many states examined whether attending a higher-rated program was 
beneficial for children from low-income households.'! High quality early childhood programs 
might be especially important for these children with fewer resources and opportunities to 
promote their development. 


'l The definitions of ow income varied across the states. See Appendix C for state-specific definitions. 
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Figure VII.2. Difference in child development outcomes between children 
attending higher- and lower-rated programs, by state 
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Sources: State validation reports for California, Delaware, Massachusetts, Minnesota, Ohio, Rhode Island, 


Washington, and Wisconsin. 


Note: Each black dot represents a state’s effect size (measured in standard deviations) for higher- versus lower- 
rated programs. Positive effect sizes mean that children in higher-rated programs performed better than 
children in lower-rated programs. Each blue dot represents the average effect size across states. The lines 
on either side of the dots denote the 95 percent confidence interval. 


4 The difference in children’s outcomes is an effect size (measured in standard deviations), which is calculated based 
on author-provided sample sizes, means, standard deviations, and other regression statistics. 
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Ratings of TQRIS also were not associated with child development outcomes for low- 
income children. In the five domains in which states analyzed outcomes separately for low- 
income children—alphabetics, comprehension, cognition, mathematics, and social-emotional 
development—none of the states found significant differences between the scores of low-income 
children in higher- and lower-rated programs on any outcome (Figure VII.3). 


Figure VII.3. Differences in child development outcomes between low-income 
children attending higher- and lower-rated programs, by domain 
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Sources: State validation reports for Delaware, Minnesota, and Rhode Island. 


Note: Each black dot represents a state’s effect size (measured in standard deviations) for higher- versus lower- 
rated programs. Positive effect sizes mean that children in higher-rated programs performed better than 
children in lower-rated programs. Each blue dot represents the average effect size across states. The lines 
on either side of the dots denote the 95 percent confidence interval. 


4 The difference in children’s outcomes is an effect size (measured in standard deviations), which is calculated based 
on author-provided sample sizes, means, standard deviations, and other regression statistics. The average is 
calculated using a statistical approach that weights each state’s effect size by a measure of its confidence (a fixed- 
effect meta-analysis that weights states by their inverse variance). 
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Potential explanations for findings for all children and low-income children 


Overall, there is no consistent evidence that children who participated in programs with 
higher rating levels had better outcomes, overall or specifically among low-income children. 
There are several potential explanations for this finding. 


It is unlikely that there would have been larger differences in outcomes if all of the 
validation studies had compared only children in programs with the lowest rating level to 
children in programs with the highest rating level. Such an analysis would provide the largest 
differences in program quality with which to best assess relationships with developmental 
outcomes. We conducted this analysis to the extent possible with the subset of states that 
compared rating levels individually, and we found no statistically significant differences across 
states in any domain. In the validation studies for the seven states that did compare rating levels 
individually (California, Delaware, Massachusetts, Ohio, Oregon, Rhode Island, and 
Washington), there were either no statistically significant differences or sporadic differences 
(most of which were in the hypothesized direction but some of which were not). 


It is unlikely that the amount of time between baseline and follow-up assessments was 
insufficient for detecting changes in children’s skills. It is possible that a longer follow-up 
period could have resulted in larger differences in children’s outcomes. Most states conducted 
baseline assessments in the fall and follow-up assessments in the spring, so there was a range of 
about one to nine months between assessments. Nearly all states showed that children made age- 
appropriate gains between baseline and follow-up on the child development measures they 
collected and analyzed for their studies. Therefore, the time between assessments was sufficient 
to detect gains, but those gains were not greater for children in higher-rated programs. 


The differences in program quality between higher- and lower-rated programs might 
not be large enough to generate differences in child development outcomes. Across states, 
the average difference in program quality corresponded to roughly half a point on the ERS or 
CLASS, which might not be large enough to cause differences in children’s development. For 
example, one study found that a point difference on the ERS or CLASS emotional support score 
was not associated with differences in children’s vocabulary, alphabetics, or math skills, but a 
point difference on CLASS instructional support was (Mashburn et al. 2008). 


The states that saw the largest differences in program quality (California and Rhode Island) 
did not see significant differences in any of the child outcome domains. Washington found the 
greatest number of significant differences in children’s outcomes, but did not analyze an external 
measure of program quality. 


The states’ validation studies also did not find strong evidence of an association between the 
program quality measures and children’s outcomes. All five states that reported on these 
associations found few or no positive associations. The findings are consistent with previous 
literature that finds, at best, modest positive associations (that is, effect sizes of less than 0.10 
standard deviations) between program quality measures and children’s outcomes (Howes et al. 
2008; Perlman et al. 2016; Brunsek et al. 2017). 


The average levels of quality among programs in TQRIS were not high enough to 
affect children’s outcomes. As previously mentioned, the ERS and CLASS scores of high- and 
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low-rated programs were not in the range described as high quality by the publishers of these 
measures. Recent studies suggest that there might be a particular threshold at which differences 
in quality result in differences in children’s outcomes (Burchinal et al. 2016; Weiland et al. 
2013). 


The ways that states calculate and award ratings could weaken the relationship 
between these ratings and children’s development. Using a large number of indicators to 
determine the ratings (including some that are not closely related to children’s classroom 
experiences) might contribute to a lack of correspondence between these ratings and children’s 
development. Previous studies have found stronger relationships between more specific 
measures of quality compared with broader measures. For example, one study found that 
measures of the quality of instruction in literacy or math or of teacher—child interactions were 
stronger predictors of children’s outcomes than broader measures of the quality of the classroom 
environment (Burchinal et al. 2016). Another study found that measures of the quality of 
teacher—child interactions had stronger relationships to children’s outcomes than simulated 
ratings (Sabol et al. 2013). In fact, only a few of the components of the ratings (such as 
child:staff ratios and group size, curriculum, staff qualifications, and quality of the environment) 
have been associated with children’s outcomes (Kirby et al. 2017). 


The rating structures and state policies, such as automatic ratings, might also weaken the 
relationship between ratings and children’s development. In block rating structures, programs 
that just miss qualifying for the next rating level might not have sufficiently different child 
outcomes from programs that barely achieved the next rating level. In hybrid and points rating 
structures, programs could reach rating levels in different ways, and not all of these might be 
strongly associated with children’s development. Automatic ratings could also weaken the 
relationship between children’s outcomes if programs that automatically received top ratings 
would not have received those ratings through the full data collection and verification process. 


The analyses might have lacked sufficient statistical power to detect differences in 
outcomes between higher- and lower-rated programs. Having a larger number of programs 
could improve the analyses’ ability to detect statistically significant differences between 
programs. However, this does not explain the patterns of small differences in some domains and 
negative differences in others. Some states also found a few statistically significant differences 
with the number of programs in their analyses, but other states’ analyses could have lacked 
sufficient statistical power. 
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Vill. CHALLENGES IN CONDUCTING VALIDATION STUDIES 


The researchers who conducted the validation studies reported challenges in four broad 
areas: study design, sample size, measure selection and data collection, and the schedule and 
timing of the study. We discuss common issues that at least five researchers perceived as either a 
major or minor challenge. See Appendix D for complete findings on all of the challenges that we 
discussed with researchers in the interviews. 


All nine researchers noted at least one limitation in their validation study design or 
analysis approach that could affect the interpretation of findings. Researchers often cited 
having a non-experimental design (in which families choose child care programs rather than 
being randomly assigned to them) as a limitation (Table VIII.1). In non-experimental designs, 
differences between the outcomes of children in higher- versus lower-rated programs might 
reflect differences between the characteristics of families that choose these programs, rather than 
true differences due to programs’ quality. Ultimately, few researchers (three of nine) perceived 
the design limitations as major; most described them as only minor challenges requiring a careful 
approach to developing research questions and interpreting findings. 


Table VIII.1. Common challenges reported by researchers who conducted 
validation studies 


Number of states Number of states 


fore} i-YelolarAlave mr-l-M-Mesar-li(cvare|-) Cer-1t-YelolarAlale m-l-M-Miil-) (ole 
Challenge Crake} (oymme) mtltave)s) fed aT-VI(cVae |=) 


Design limited the interpretation of findings 9 3 


Deciding which rating levels to validate 5 1 
Recruiting programs 8 4 
Attaining sufficient representation across 8 5 


program types and rating levels 
Recruiting children 
Low response rates from programs 


Selecting measures of child development 7 
Missing data for programs or children 7 
Analyzing administrative data? 6 
Analyzing measures of child development 5 
Obtaining administrative data? 5 
Selecting measures of program quality 5 
Limited data on family/child characteristics 5 


Conducting study before TQRIS were fully 
implemented® 


Collecting data in the allotted study time frame 6 1 


Sources: Interviews with principal investigators who conducted the state validation reports in California, Delaware, 
Massachusetts, Minnesota, Ohio, Oregon, Rhode Island, Washington, and Wisconsin. 


Note: Table includes only challenges that at least five states reported. 
4 Applies to only eight states. 


> This issue applied to only a subset of states. TQRIS were not fully implemented or were in a major transition period 
at the start of the validation study in six of the nine states. 
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The most common challenges related to sample size and representativeness were 
recruiting programs for the study and attaining sufficient representation across rating 
levels or program types. Eight of the nine researchers cited each of these as challenges. 
Recruiting child care providers for the study (categorized as major by four of nine) and attaining 
sufficient representation across rating levels or provider types (categorized as major by five of 
nine) were perceived as particularly challenging. Those two challenges were categorized as 
major more frequently than any other (Table VIII.1). For example, researchers in four states 
noted that recruiting family child care providers was particularly difficult. As one researcher 
pointed out, family child care providers are typically smaller operations with fewer resources to 
offset the burden of participating in a study. Family child care providers and the families they 
serve also tend to be less stable and harder to reach than center-based programs. Researchers 
most often struggled to recruit a sufficient number of child care providers from each rating level 
(five of eight reported recruiting sample challenges). This challenge typically resulted from 
having very few child care providers at certain rating levels, most often the highest levels. 


The most common challenges related to measuring child development outcomes and 
collecting data were selecting measures and having missing data for programs or children. 
Seven of the nine researchers cited each of these as challenges. Researchers for six validation 
studies reported challenges selecting child development measures, noting the difficulty in 
striking a balance between avoiding an excessively lengthy battery of tests and measuring a 
broad mix of outcomes. On average across the eight states that collected child development data, 
a state administered six distinct child assessments (each of which might have included multiple 
subscales). The number of child assessments administered by each state ranged from 2 to 12. 
Two of the six researchers also encountered challenges sourcing a test battery available in 
multiple languages. Researchers also struggled with missing data; two researchers specifically 
cited missing values or variables in administrative data as problematic, whereas four researchers 
faced challenges with missing data for families and children, due to both attrition and low 
response rates on parent surveys. 


In six states where the validation study began before or concurrent with full 
implementation of TQRIS, researchers viewed this timing for conducting the validation 
study as problematic. When TQRIS were not fully implemented at the time of the study, 
researchers experienced challenges identifying eligible centers and their rating categories, given 
that ratings could be outdated. Further, providers that were confused or overburdened by system 
changes were frequently reluctant to participate and difficult to recruit. Findings based on 
systems in flux might also be difficult to interpret. 


Researchers also reported additional challenges, not specifically queried in the 
interviews, about relating the ratings to children’s outcomes. First, researchers for two 
studies noted that research questions related to children’s outcomes, which are often distal from 
child care program quality, were unrealistic given the timeline for the validation studies. They 
emphasized the need for a more longitudinal perspective when examining the relationship 
between the rating levels and kindergarten readiness or other outcomes. Two other researchers 
pointed to characteristics of the system that might result in ratings not accurately reflecting 
program quality. One noted that programs can choose to stay at the same rating level if there is 
not sufficient incentive to achieve a higher rating. Another researcher pointed out that the ratings 
can mask variation among programs: at a particular level, programs can exceed at least some of 
the standards for their rating. 
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IX. DISCUSSION 


This synthesis of validation studies conducted by nine RTT-ELC states provides some 
evidence that ratings of TQRIS capture differences in program quality. Compared with lower- 
rated programs, programs with higher rating levels had significantly higher scores on 
independently collected measures of program quality in nearly all states. However, differences in 
program quality reflected in the ratings did not translate into differences in children’s outcomes. 
This finding is largely consistent with the findings of previous validation studies, the findings of 
the simulation conducted by Sabol and colleagues (2013), and a recent synthesis of state studies 
conducted by Tout et al. (2017). 


There are several likely reasons children in higher-rated programs did not consistently 
perform better than children in lower-rated programs. First, the overall levels of quality among 
programs, especially among higher-rated programs in TQRIS, were not high by publishers’ 
standards. Second, the differences in program quality might not have been large enough to 
produce significant differences in children’s outcomes. Third, rating levels might not align 
specifically enough to practices that would influence children’s development. The quality of 
classroom interactions and instructional practices is typically only one of numerous individual 
components upon which the ratings are based. Furthermore, TQRIS are structured to promote 
participation and quality improvement by giving programs flexibility in how they obtain ratings. 
Some of this flexibility comes in the form of points or hybrid systems or in the availability of 
automatic rating options. These characteristics might contribute to the lack of a relationship 
because the rating levels capture quality in a broad and inconsistent way. 


The program quality and child development findings underscore a challenge in 
implementing TQRIS to meet multiple goals. States design their systems to draw attention to 
multiple dimensions of quality that are important (Kirby et al. 2015; Zellman and Perlman 2008). 
However, some dimensions, such as the quality of program administration, are not as closely 
related to children’s experiences as what happens in the classroom. How much this matters for 
the ultimate goals of TQRIS may depend on how the ratings are used. Ratings that are not 
associated with children’s outcomes could still help states identify and support programs that are 
low-performing on other important dimensions. Yet, if the key objective of TQRIS is to help 
states and parents identify programs that improve children’s developmental outcomes, states may 
want to consider this objective when making revisions to their TQRIS design. 


As TQRIS become more fully implemented, it will be necessary to conduct additional 
studies that examine the validity of ratings. This synthesis documented that researchers faced 
challenges in designing and executing studies to answer the questions of interest related to the 
relationships between the ratings and program quality and children’s outcomes given the 
implementation status of TQRIS and the time frames in which they conducted these studies. 
When TQRIS are first rolled out, it can be important for validation studies to assess whether the 
measures and ratings are implemented as planned, in addition to examining the relationships 
between the ratings and external quality measures or assessments of children’s developmental 
outcomes (Tout and Starr 2013). 


As TQRIS continue to mature and include more programs, the goals of future validation 
studies could change. As Zellman and Fiene (2012) have pointed out, validation should be an 
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ongoing effort that not only sheds light on whether ratings are useful, but also helps inform 
refinements to improve the system. 


Although our findings were relatively consistent across states, several limitations hamper 
our ability to draw conclusions from the synthesis of state validation studies. First, our synthesis 
is limited to only nine states with wide variability in TQRIS. Second, this synthesis is based on 
non-experimental designs that might not completely account for existing differences between 
children attending higher- and lower-rated programs. We also relied on the analyses presented by 
the researchers in each state. This meant that we could not compare individual rating levels in all 
states because not all states did so. We also did not have data for individual programs or on each 
component, which could be used to better understand why programs with different rating levels 
did not have larger differences in children’s outcomes. Along similar lines, the analyses that 
researchers presented focused primarily on preschoolers in center-based programs. Some 
researchers combined data on preschoolers with data on infants and toddlers, or combined data 
for centers with data for family child care programs, so we did not have sufficient data to look at 
whether ratings were more or less predictive based on children’s age or based on the type of 
program. Finally, the responses researchers gave in the interviews were relative to each 
researcher’s previous experience. Thus, the extent to which they viewed particular elements as 
challenging likely varied for reasons outside of the validation study itself. Additional states will 
release validation reports that will contribute to our knowledge about whether TQRIS 
differentiate between programs of differing quality and whether those differences are related to 
children’s development. 
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This appendix lists all of the outcome domains and measures that states used in their 
validation studies and provides more information about measures of program quality. Table A.1 
shows the measures used in each domain. Table A.2 shows the measures used by each state. 
Table A.3 describes two common measures of program quality: (1) the Early Childhood 
Environment Rating Scale (ECERS), and (2) the Classroom Assessment Scoring System 
(CLASS). 
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Table A.1. Outcome domains used for establishing validity 


Outcome 
domain 


Description 


Measures used in state validation reports 


Cognition Includes outcomes in the following areas: memory, e Bracken School Readiness Assessment (BSRA) 
problem-solving, cognitive processing and flexibility, e Peg Tapping 
and general knowledge (including school readiness e Head-Toes-Knees-Shoulders (HTKS) 
and intelligence quotient [!Q]) e Pencil Tap Test 

e Mullen Scales of Early Learning (MSEL), Visual 
Reception 

Mathematics Includes outcomes in the following areas: basic e Woodcock Johnson Tests of Achievement (WJ 
number concepts, number operations, patterns and Ill), Applied Problems 
classification, measurement, geometry, and general e Tools for Early Assessment in Math (TEAM) 
numeracy 

Language Includes outcomes that assess the ability to e Mullen Scales of Early Learning (MSEL), 


development 


understand spoken language, communicate and 
understand thoughts or ideas through speech, use 
developmentally appropriate discourse skills, and 
display grammatical knowledge or skill 


Expressive Language and Receptive Language 
Brigance Inventory of Early Development (IED), 
Language development 


Alphabetics 


Includes outcomes in the following areas: phonemic 
and phonological awareness, letter identification, 
print awareness, and phonics 


Woodcock Johnson Tests of Achievement (WJ 
Ill), Letter-Word Identification 

Test of Preschool Early Literacy (TOPEL) 
Early Writing Assessment (EWA) 


Comprehension 


Includes outcomes in the areas of vocabulary and 
comprehension development 


Woodcock Johnson Tests of Achievement (WJ 
Ill), Picture Vocabulary 

Peabody Picture Vocabulary Test (PPVT) 
Individual Growth and Developing Indicators 
(IGDI) 


General 
reading 
achievement 


Includes outcomes that combine measures in two or 
more of the previous domains (for example, 
alphabetics and comprehension) or provide some 
other type of summary score across domains, such 
as a total reading score on a standardized reading 
test 


Story and Print Concepts 
Brigance Inventory of Early Development (IED), 
Literacy 


Science Includes outcomes related to children’s content and e Lens on Science (LENS) 

processing skill knowledge in science 
Social- Includes outcomes in the following areas: behavioral, Devereaux Early Childhood Assessment (DECA) 
emotional social, and emotional competencies underlying Preschool Learning Behaviors Scale (PLBS) 


development 


school readiness, such as pro-social (or problem) 
behaviors, social interactions, cooperation, 
self-concept, engagement, attention, persistence, 
impulsivity, self-control, and initiative 


Social Competence and Behavior Evaluation 
(SCBE) 
Child Behavior Checklist (CBCL) 


Motor skills 


Program quality 


Includes outcomes measuring either fine and/or gross 
motor skills 


Includes program-level measures of observed quality 
and curriculum practices 


Mullen Scales of Early Learning (MSEL), Fine 
Motor and Gross Motor 


Early Childhood Environment Rating Scale- 
Revised (ECERS-R) 

Early Childhood Environment Rating Scale- 
Revised (ECERS-3) 

Family Child Care Environment Rating Scale- 
Revised (FCCERS-R) 

Infant/Toddler Environment Rating Scale-Revised 
(ITERS-R) 

Classroom Assessment Scoring System (CLASS) 
Early Language and Literacy Classroom 
Observation (ELLCO) 

Child Home Early Language and Literacy 
Observation (CHELLO) 

Arnett Caregiver Interaction Scale (CIS) 
Preschool Program Quality Assessment (PQA) 


Sources: 


and Wisconsin. 


State validation reports for California, Delaware, Massachusetts, Minnesota, Ohio, Oregon, Rhode Island, Washington, 
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Table A.2. Outcome measures used for establishing validity 


Outcome measure CA DE MA IN} OH OR | 

WJ Ill Letter-Word Identification x X Xx x x X 
TOPEL xX X 
EWA Xx 

W4J Ill Picture Vocabulary x 

PPVT Xx Xx X 

IGDI Xx 

Story and Print Concepts x 

Brigance (IED) Literacy Xx 

MSEL Expressive Language X 

MSEL Receptive Language Xx 

Brigance (IED) Language Xx 


development 


BSRA Xx 
Peg Tapping Xx Xx 
HTKS x x a 


x< 


Pencil Tap Test 
MSEL Visual Reception 


x< 


x< 
x< 
x< 
x< 
x< 
x< 


W4J III Applied Problems 
TEAM 


x< 


LENS xX 

DECA Xx x 

PLBS Xx Xx x 
SCBE Xx Xx x 
CBCL X 


x< 


MSEL Fine Motor 
MSEL Gross Motor 


x< 


x< 


ECERS-R x x 
ECERS-3 

FCCERS-R x 
ITERS-R xX 

CLASS xX X x 
ELLCO 

CHELLO 

CIs x x 

PQA xX 


Sources: State validation reports for California, Delaware, Massachusetts, Minnesota, Ohio, Oregon, Rhode Island, 
Washington, and Wisconsin. 


><) >| K) >) ><) OX! OK 
x< 
x< 
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Table A.3. Common program quality outcome measures 


Program quality =AVi(o(cYavex-Moy mm od c-to | (eid \Y-) 

measure Description Domains Tero) dare} EVI CoTi ays 

Early Childhood Instrument that Space and 1-2 Inadequate Studies find significant 

Environment measures Furnishings, Personal 3-4 Minimal positive associations 

Rating Scale- quality of Care Routines, 5-6 Good between ECERS-R scores 

Revised (ECERS- classroom Language Reasoning, 7 Excellent or subscores and children’s 

R) environment Activities, Interaction, cognition or social outcomes 
Program Structure, (Clifford et al. 2010). 


and Parents and Staff 


Infant/Toddler Instrument that Space and 1-2 Inadequate Studies find significant 

Environment measures Furnishings, Person 3-4 Minimal positive associations 

Rating Scale quality of Care Routines, 5-6 Good between ITERS and 

Revised (ITERS- classroom Listening and Talking, 7 Excellent cognitive development, 

R) environment Activities, Interaction, language development, and 
Program Structure, communication skills 
Parents and Staff (Burchinal et al. 1996). 


Caregiver Instrument that Sensitivity, 1 Not true at all Studies find significant 

Interaction Scale measures Harshness, 2 Somewhat true positive associations 

(CIS) teacher-child Detachment, 3 Quite a bit true between CIS scores and 
interactions in Permissiveness 4 Very much true — children’s academic or social 
classroom outcomes (Loeb et al. 2004). 


Sources: Harms et al. (2005), Harms et al. (2007), Harms et al. (2003), Pianta et al. (2008), Arnett (1989), Clifford et 
al. (2010), Burchinal et al. (1996), Teachstone (2017), Loeb et al. (2004). 

Note: Predictive validity is the extent to which scores on these measures of program quality are predictive of 
future scores on assessments or measures of children’s academic and socio-emotional outcomes. 
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Our approach for synthesizing results across the nine states follows the What Works 
Clearinghouse (WWC) standards version 3.0 (U.S. Department of Education 2014). The WWC 
released version 4.0 of their standards in October 2017 after we had collected data from author 
queries and conducted our analyses (U.S. Department of Education 2017). This appendix 
provides additional details about the key differences between the two versions of the WWC 
standards, our approach for synthesizing results, and the supplemental analysis that compared the 
highest and lowest rating levels possible. 


Differences between versions 3.0 and 4.0 of the WWC standards 


This section describes the key differences between the two versions of the WWC standards. 
It also discusses how findings might have changed had we used the WWC standards version 4.0. 


The approach followed several key steps based on the WWC standards version 3.0. First, we 
assessed baseline equivalence for each finding related to children’s outcomes. Second, we 
included findings in the analysis if (1) baseline differences were below the WWC cutoff (0.25 
standard deviations), and (2) the study used a WWC-approved statistical method to adjust for 
any baseline differences that fell between 0.05 and 0.25 standard deviations. Under the WWC 
3.0 standards, difference-in-difference adjustments and simple gain scores were not approved 
statistical methods for adjusting for baseline differences, and imputed baseline data could not be 
used to assess baseline equivalence. 


The WWC standards version 4.0 expand the set of approved statistical methods to include 
difference-in-difference adjustments and simple gain scores and allow for more flexibility in 
using imputed baseline data, as long as the imputation uses a WWC-approved statistical method 
(U.S. Department of Education 2017). If we had used the new standards, it is possible that 
additional findings would have met baseline equivalence standards, and that we might have been 
able to include more findings related to children’s outcomes in our analysis. Given that findings 
from this report are similar to those in Tout et al. 2017 (which analyzed all findings on children’s 
outcomes, regardless of baseline equivalence), it seems unlikely that including additional 
findings would have altered this report’s conclusions. 


Study approach based on the WWC 3.0 standards 


This section explains how we assessed how similar children were on the baseline 
assessments, calculated differences between programs with high and low ratings, determined 
statistical significance, and combined findings within and across states. This approach was based 
on the WWC standards version 3.0. 


Assessing how similar children are on baseline assessments 


The most convincing evidence of a relationship between tiered quality rating and 
improvement system (TQRIS) ratings and children’s outcomes comes from analyses that 
compare children who had similar skills before attending a higher- or lower-rated program 
(baseline equivalence). To determine which of the analyses conducted by states meet this 
standard, we calculated the standardized mean difference between groups on baseline measures, 
when available. We calculated the standardized mean difference using Hedges’ g with an 
adjustment for small samples. This difference was based on the means of the child development 
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measure at baseline for higher- and lower-rated programs (y,,. and y,, ), the respective sample 
sizes (n,, and n, ), the respective program-level standard deviations at baseline (s,,, and s,,), 


and the small sample size correction #=[1-s/(4N —9)], with N being the total sample size. 


Yu0 ~ Y10 
(1) g= 
(ay —1) si +(n, —1)s7, 


Aehly A2 


Based on standardized mean differences at baseline and the WWC version 3.0 baseline 
equivalence standards, we classified differences into three categories as follows: 


1. Differences meet baseline equivalence if they are between 0.00 and 0.05 standard 
deviations. 


2. Differences meet baseline equivalence if they are greater than 0.05 and less than or equal to 
0.25 standard deviations and the analysis included a statistical control for the outcome 
measured at baseline. 


3. Differences do not meet baseline equivalence if they are greater than 0.25 standard 
deviations. 


Calculating differences between programs with high and low ratings 


We calculated a standardized difference between higher- and lower-rated programs for all 
outcome measures. The standardized measure puts all of the outcome measures on the same 
metric (a standard deviation), enabling us to compare results from analyses that used different 
measures in the same domain. For example, scores on the Classroom Assessment Scoring 
System (CLASS) range from | to 7 and scores on the Preschool Program Quality Assessment 
(PQA) range from | to 5, so we cannot compare them unless we use a standardized metric. For 
each outcome measure, we calculated effect sizes using approaches developed for the WWC. 
Specifically, we used a Hedges’ g effect size, which calculates the average difference in 
outcomes between higher- and lower-rated programs based on the analytic method used by the 
states. The different types of calculations are described below. 


Standardized mean difference. For the program quality measures, we calculated the 
standardized mean difference using Hedges’ g with an adjustment for small samples. This 
difference was based on the means of the program quality measure for higher- and lower-rated 
programs (y,, and y, ), the respective sample sizes (n,, and n, ), the respective program-level 
standard deviations (s,, and s, ), and the small sample size correction w =[1-s/(4N-9)], with 
N being the total sample size. The standardized mean difference was defined as 


o(y, —y,) 
(ny —1)s;, +(n, —1)s; 


Pe 


2) z= 
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Effect size based on regression analysis. For child development outcomes, the estimated 
mean difference between children in higher- and lower-rated programs was often based on a 
regression analysis that statistically controlled for the baseline score. If the regression-adjusted 
means were provided, we calculated the Hedges’ g using regression-adjusted means on the child 


development measure (y,, and y, ) and the unadjusted standard deviations. The effect size using 
regression-adjusted means was given by 


Gyg= oY -Y1) 
{{ —1)s;, +(n, —1)s; 


Ne, 2 


Hedges’ g can also be calculated using information from hierarchical linear models, which were 
often used in the validation studies to account for the nesting of children within programs. If the 
level-two coefficient from this model (y ) was provided, the effect size was calculated as 


Qy 
4)-g= 
(ny, —1)s;, +(n, —1)s; 
Ny +n, —2 


Effect size using a difference-in-differences adjustment. For child development 
outcomes, if regression-adjusted means or coefficients were not available, we made a difference- 
in-differences adjustment to Hedges’ g. Specifically, we computed the numerator as the 
difference between the baseline and follow-up mean difference for children attending higher- 
rated programs and the baseline and follow-up mean difference for children attending lower- 
rated programs. Defining y,,, and y,, as the unadjusted baseline means, the difference-in- 


differences effect size was defined as 


| (Yn —Yn0)-(%1. - Yio) | 
we {{ —1)s7, +(n, -1)s; 


as Pt 


Gain scores. We approached the use of gain scores in validation study analyses based on the 
WWC version 3.0 guidance. First, using a gain score in an analysis does not ensure that groups 
are equivalent at baseline. Such analyses must still demonstrate baseline equivalence using 
baseline means and standard deviations. Second, using a gain score as the dependent variable in 
a model does not account for the correlation between scores on the baseline and follow-up 
assessments. If baseline differences required a statistical adjustment, the analysis had to include 
the baseline score separately as a covariate. Lastly, effect sizes computed using means and 
standard deviations of gain scores for children attending higher- and lower-rated programs are 
not comparable with the effect sizes described earlier because the metric differs. We calculated 
effect sizes based on gain score means only if the standard deviations of unadjusted follow-up 
scores were also provided. 


B-4 


APPENDIX B SYNTHESIS OF TQRIS VALIDATION STUDIES 


Imputation. We approached the use of imputation based on the WWC version 3.0 guidance. 
First, baseline equivalence cannot be demonstrated using imputed data. Second, researchers may 
impute missing data for covariates, but not for outcomes. When the analysis used imputed 
assessment scores, we requested information using only cases with nonmissing assessment 
scores and used those data for assessing baseline equivalence and calculating effect sizes. 


Determining statistical significance 


For all differences of program quality and child development outcomes, we calculated the 
corresponding statistical significance level for each program quality and child outcome 
difference and corrected for multiple comparisons when states analyzed multiple measures 
within a domain. We used the Benjamini-Hochberg method to correct for multiple comparisons. 
This correction lowers the critical p-value (from 0.05) for individual comparisons based on the 
rank order of the p-value for that comparison and the total number of comparisons. Because of 
this correction, some effect sizes with p-values that are less than 0.05 might not be significant 
(because they are compared with a more conservative threshold). See The WWC Procedures and 
Standards Handbook, version 3.0, for a full description of the correction. 


Combining findings 


We followed WWC procedures (version 3.0) for combining findings within states. If there 
was more than one measure in the program quality domain for a state, we calculated the average 
difference between programs with high and low ratings as the average of the differences. For 
child development outcomes, we calculated the difference between children in higher- and 
lower-rated programs in the same way, by taking the average of the differences for the measures 
that met baseline equivalence standards. For each domain within a state, we calculated the 
corresponding statistical significance level using the average sample sizes in the higher- and 
lower-rated groups. 


To combine findings across states within a domain, we used a fixed-effect meta-analysis 
approach. In this approach, the average difference between programs with higher and lower 
ratings and the corresponding statistical significance level are both estimated by applying a 
weight to each state equal to the inverse of its within-state variance. We implemented the fixed- 
effect meta-analysis approach for child development outcomes in which there were findings from 
two or more states. For each fixed-effect meta-analysis, we also conducted a Q test for 
heterogeneity in the findings across states, which follows a chi-square distribution. 


Supplemental analysis of the highest and lowest rating levels possible 


This section presents additional information on the supplemental analysis that compared the 
highest and lowest rating levels possible. Table B.1 shows the groups of ratings analyzed and 
Table B.2 presents the findings from this analysis. 
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Table B.1. Contrasts for supplemental analysis of highest and lowest 
possible ratings that could be examined in each state 


Domains used in average 
of findings that meet 


Number of rating Intervention Comparison baseline equivalence 
levels exe} ate lid(eya} condition Elite Tae] 


California 5 Tier 5 Tier 2 Alphabetics 


General reading 
achievement 
Mathematics 


Massachusetts 4 Level 3 Level 1 Alphabetics 
Mathematics 

Social-emotional 
development 


Rhode Island 5 5-star 1-star Social-emotional 
development 


Source: State validation reports for California, Delaware, Massachusetts, Ohio, Rhode Island, and Washington. 


Note: The largest contrast in each state was created using data provided by the authors. Minnesota and 
Wisconsin did not report disaggregated data by rating level and we did not request that the authors perform 
additional analyses. 


@ There were only six Level 4 programs in the Massachusetts sample, so the study excluded that level from all 
statistical analyses. 


Table B.2. Findings from the supplemental analysis of highest and lowest 
possible ratings that could be examined in each state 


Number of Meta-analytic effect size for findings that meet 
Domain states baseline equivalence standard (p-value) 
Alphabetics 3 0.01 (0.88) 
General reading achievement 2 -0.12 (0.33) 
Mathematics 3 0.09 (0.34) 
Social-emotional development 3 0.03 (0.70) 


Source: Author calculations. 


Note: Positive results for effect size favor the intervention group; negative results favor the comparison group. 
The effect size is a standardized measure of the effect of an intervention on student outcomes, 
representing the change (measured in standard deviations) in an average child’s outcome that can be 
expected if the student receives the intervention. 
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This appendix contains a detailed description of each state’s validation study and its 
findings. Some of the study information provided was reported by only a subset of states. 


California TQRIS validation study 


Setting: The study includes center-based programs in the state of California. (The report 
presents separate supplemental analyses for a small sample of family child care providers.) 


Participants: 166 programs, 33 of which are 2-star-rated programs, and 133 of which are 
3-, 4-, or 5-star-rated programs; 1,552 children in the sample. The sample included 35 percent of 
the 472 fully rated programs participating in the California TQRIS at the time of the study. 


Program quality outcomes were collected from a subset of classrooms within programs. 
Child development outcomes were collected from a sample of children in a subset of classrooms 
within programs. 


Sample characteristics: Sample characteristics are not provided for the full sample of 1,552 
children. For those included in an analysis of the association between teacher participation in 
quality improvement activities and child outcomes (1,075 children), females made up 49 percent 
of the sample and males made up 51 percent of the sample. Nine percent had special needs. 
Sixty-four percent spoke exclusively Spanish at home or both Spanish and English, whereas 
fewer than one-third spoke exclusively English at home. Sixty-three percent were assessed as 
proficient in English. 


Authors’ statistical approach: Analysis of variance (ANOVA) was used to compare 
whether average program quality outcomes differed significantly by individual rating level. The 
authors used hierarchical linear modeling (HLM) regressions of child outcomes on individual 
rating levels, controlling for baseline scores and child and family characteristics. 


The study also examined associations between the program quality measures used in the 
TQRIS ratings and child outcomes. 


Synthesis contrast: The intervention condition was attending a program rated 3, 4, or 5 
stars. The comparison condition was attending a program rated 2 stars. 


APPENDIX C SYNTHESIS OF TQRIS VALIDATION STUDIES 


Table C.1. California findings for child development outcomes 


UFY-Yo Mame Ceynar-1ia) 


average of 
findings that 
Statistically meet baseline 
Sample adjusted effect Yo LUTNVE-1 (=) aver) 
Domain Outcome measure size size (p-value) standard? 
Alphabetics W4J Ill Letter-Word Preschoolers N=1,524 -0.08 (0.59) Xx 
Identification 
General reading Story and Print Concepts Preschoolers N=1,552 -0.18 (0.23) Xx 
achievement 
Cognition Peg Tapping Preschoolers N=1,552 0.22 (0.15) 
Mathematics W4J II Applied Problems —= Preschoolers N=1,499 0.14 (0.36) x 


Source: State validation report for California. 


Note: Positive results for mean difference and effect size favor the intervention group; negative results favor the 
comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 
can be expected if the student receives the intervention. 

4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 

standard deviations or (2) if the baseline difference between groups is between 0.05 and 0.25 standard deviations 

and a statistical adjustment for the baseline score was made. In the case that a statistically adjusted effect size and a 

difference-in-differences effect size were calculated, the statistically adjusted effect size will be used in the average. 

* Finding is statistically significant after adjusting for multiple comparisons if necessary. 


W4J Ill = Woodcock Johnson Tests of Achievement. 


Table C.2. California findings for program quality outcomes 


Standardized mean 


Domain Outcome measure Sample Sample size _ difference (p-value) 
Program quality CLASS Classroom Organization Centers N = 166 0.27 (0.17) 
Program quality CLASS Emotional Support Centers N = 166 2.40* (0.00) 
Program quality CLASS Instructional Support Centers N = 166 0.90* (0.00) 


Source: State validation report for California. 


Note: Positive results for standardized mean difference favor the intervention group; negative results favor the 
comparison group. 


* Finding is statistically significant after adjusting for multiple comparisons if necessary. 
CLASS = Classroom Assessment Scoring System. 
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Delaware Stars for Early Success validation study 


Setting: The sample includes centers or school-based providers of infant and toddler care, as 
well as preschool and family child care providers. All providers are located in Delaware. 


Participants: 156 programs, 32 of which are Starting with Stars or Star 2-rated programs, 
and 124 of which are Star 3-, Star 4-, or Star 5-rated programs. 1,012 children in the sample. The 
study used weights to generalize study findings to the full population of programs participating 
in Delaware Stars at the time of the study. 


Program quality outcomes were collected from a subset of classrooms within programs. 
Child development outcomes were collected from a sample of children within programs. 


Sample characteristics: Children in the sample attended early care programs, were not yet 
in kindergarten, and were 5 years of age or younger. The authors did not have birthdates for the 
children but reported the distribution of children in the sample by cohort. Children ranging in age 
from 2 to 3 years (born September 1, 2011, to August 31, 2012) and eligible to enter 
kindergarten in fall 2017 accounted for 22 percent of the sample. Children ranging in age from 3 
to 4 years (born September 1, 2010, to August 31, 2011) and eligible to enter kindergarten in fall 
2016 accounted for 38 percent of the sample. Children ranging in age from 4 to 5 years (born 
September 1, 2009, to August 31, 2010) and eligible to enter kindergarten in fall 2015 comprised 
40 percent of the sample. The child sample was nearly one-half white and non-Hispanic, one- 
quarter African American, and 15 percent Hispanic. One-third of assessed children were from 
families earning less than $25,000 per year; the child sample had a proportionately larger share 
of low-income children relative to the statewide population. 


Authors’ statistical approach: Comparison of regression-adjusted program quality 
outcome means separately by rating level. Comparison of child development outcome means that 
were regression-adjusted for baseline scores and child and family characteristics, separately by 
rating category. 


The study also examined associations between program quality outcomes and child 
development outcomes for all children and for low-income children. Low-income children were 
defined as those whose families had incomes of $25,000 or less or those receiving subsidized 
care. 


Synthesis contrast: The intervention condition was attending a program rated Star 3, Star 4, 
or Star 5. The comparison condition was attending a program rated Starting with Stars or Star 2. 


APPENDIX C SYNTHESIS OF TQRIS VALIDATION STUDIES 


Table C.3. Delaware findings for child development outcomes 


UFY-Ye Mame Coynar-lia) 


EN Velo (-Me) i 
Statistically findings that 
Xe) [Ui c-ve| meet baseline 
effect size equivalence 
Domain Outcome measure (p-value) standard? 
Alphabetics W4J Ill Letter-Word Preschoolers and N=1,012  -0.09 (0.32) 
Identification toddlers 
Comprehension PPVT Preschoolers and N = 926 -0.06 (0.53) 
toddlers 
Cognition HTKS Preschoolers and N = 741 0.25* (0.02) x 
toddlers 
Mathematics W4J III Applied Problems — Preschoolers and N = 933 -0.16 (0.08) 
toddlers 
Social-emotional _DECA Absence of Preschoolers and N = 776 0.13 (0.22) xX 
development Behavior Problems toddlers 
Social-emotional DECA Total Protective Preschoolers and N= 917 -0.11 (0.23) xX 
development Factors toddlers 


Source: State validation report for Delaware. 


Note: Positive results for mean difference and effect size favor the intervention group; negative results favor the 
comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 
can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations and a 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 


* Finding is statistically significant after adjusting for multiple comparisons if necessary. 
W4J Ill = Woodcock Johnson Tests of Achievement. 

PPVT = Peabody Picture Vocabulary Test. 

HTKS = Head-Toes-Knees-Shoulders. 

DECA = Devereaux Early Childhood Assessment. 


APPENDIX C 


SYNTHESIS OF TQRIS VALIDATION STUDIES 


Table C.4. Delaware findings for child development outcomes for low-income 


children 


Domain 


Alphabetics 
Comprehension 
Cognition 
Mathematics 
Social-emotional 


development 


Social-emotional 
development 


Outcome measure 


W4J Ill Letter-Word 
Identification 


PPVT 


HTKS 


W4J Ill Applied 
Problems 


DECA Absence of 
Behavior Problems 


DECA Total 
Protective Factors 


Fe Trt) ©) (Me) mm eds 
Takexe) nat=Mes ali elc-va 


Preschoolers and 
toddlers 


Preschoolers and 
toddlers 


Preschoolers and 
toddlers 


Preschoolers and 
toddlers 


Preschoolers and 
toddlers 


Preschoolers and 
toddlers 


N = 491 
N = 436 
N = 349 
N=417 
N = 393 
N = 439 


Statistically 
Xe) [UE cre | 
effect size 
(p-value) 


-0.14 (0.25) 


-0.16 (0.18) 


0.09 (0.51) 


-0.15 (0.22) 


0.11 (0.40) 


-0.24 (0.05) 


Used in 
domain 
average of 
findings that 
meet baseline 
Yo LUTNVE-1(= aver) 
Cir laver-lce be 


Source: State validation report for Delaware. 


Note: 


Positive results for mean difference and effect size favor the intervention group; negative results favor the 


comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 


can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations and a 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 


W4J II = Woodcock Johnson Tests of Achievement. 


PPVT = Peabody Picture Vocabulary Test. 


HTKS = Head-Toes-Knees-Shoulders. 
DECA = Devereaux Early Childhood Assessment. 
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Table C.5. Delaware findings for program quality outcomes 


Standardized mean 


difference 

Domain Outcome measure Sample (p-value) 

Program quality CIS Centers and homes N = 155 0.47* (0.02) 

Program quality CLASS Classroom Centers and homes N = 156 0.63* (0.00) 
Organization (pre-K) 

Program quality CLASS Emotional Support Centers and homes N = 156 0.67* (0.00) 
(pre-K) 

Program quality CLASS Instructional Support Centers and homes N = 156 0.31 (0.12) 
(pre-K) 

Program quality CLASS Emotional and Centers and homes N = 108 0.48* (0.03) 
Behavioral Support (toddler) 

Program quality CLASS Engaged Support for Centers and homes N=108 0.52* (0.02) 
Learning (toddler) 

Program quality PQA Centers and homes N=149 0.87* (0.00) 


Source: State validation report for Delaware. 


Note: Positive results for standardized mean difference favor the intervention group; negative results favor the 
comparison group. Delaware also calculated group means controlling for a variety of provider and 
neighborhood-level characteristics. The standardized mean differences and p-values using these 
regression-adjusted means are: CIS: 0.06 (0.75); CLASS Classroom organization (pre-K): 0.24 (0.22); 
CLASS emotional support (pre-K): 0.02 (0.92); CLASS instructional support (pre-K): -0.19 (0.33); CLASS 
emotional and behavioral support (toddler): 0.58* (0.01); CLASS engaged support for learning (toddler): 
0.39 (0.08); PQA: 0.68% (0.00). 

* Finding is statistically significant after adjusting for multiple comparisons if necessary. 

CIS = Arnett Caregiver Interaction Scale. 

CLASS = Classroom Assessment Scoring System. 


PQA = Preschool Program Quality Assessment. 
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Massachusetts TQRIS validation study 


Setting: The study includes center-based programs in Massachusetts exclusively and does 
not include family-based, school-based, or after-school or out-of-school-time programs. 


Participants: 120 programs, 79 of which are Level 1- or Level 2-rated programs, and 41 of 
which are Level 3-rated programs. 402 children in the sample. 


Program quality outcomes were collected from a subset of classrooms within each program. 
Child development outcomes were collected from a sample of children within each program. 


Sample characteristics: Sample characteristics are provided for the total sample of 
participating children (462), which is larger than the analysis sample. Among the total sample, 
22 percent are English language learners, 13 percent are classified as special education, and 55 
percent are from families receiving tuition subsidies. 


Authors’ statistical approach: ANOVA to compare whether average program quality 
outcomes differed significantly by individual rating level. HLM regressions of children’s 
outcomes on individual rating levels, controlling for baseline scores and child and family 
characteristics. 


Synthesis contrast: The intervention condition was attending a program rated Level 3. The 
comparison condition was attending a program rated Level | or Level 2. 


APPENDIX C SYNTHESIS OF TQRIS VALIDATION STUDIES 


Table C.6. Massachusetts findings for child development outcomes 


Used in 
domain 
average of 
itateltate mires 
Statistically meet baseline 
adjusted effect equivalence 
Domain Outcome measure Sample Sample size size (p-value) standard? 
Alphabetics W4J Ill Letter-Word Preschoolers N = 402 -0.05 (0.65) Xx 
Identification 
Comprehension PPVT Preschoolers N = 402 0.10 (0.35) x 
Mathematics W4J III Applied Preschoolers N = 402 -0.11 (0.31) x 
Problems 
Social-emotional DECA Preschoolers N = 397 0.05 (0.62) x 
development 
Social-emotional PLBS Preschoolers N = 389 -0.18 (0.10) 


development 


Source: State validation report for Massachusetts. 


Note: Positive results for mean difference and effect size favor the intervention group; negative results favor the 
comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 
can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations and a 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 


> Effect size is calculated as a difference-in-differences adjustment to Hedges’ g. A statistically adjusted effect size 
was not calculated because regression-adjusted means or coefficients were not available. 


W4J Ill = Woodcock Johnson Tests of Achievement. 
PPVT = Peabody Picture Vocabulary Test. 

DECA = Devereaux Early Childhood Assessment. 
PLBS = Preschool Learning Behaviors Scale. 


Table C.7. Massachusetts findings for program quality outcomes 


Standardized mean 


Outcome difference 
Domain measure Sample size (p-value) 
Program quality CIS Centers (infants and toddlers) N= 73 0.41 (0.16) 
Program quality CIS Centers (preschoolers) N = 120 0.45* (0.02) 
Program quality ECERS-R Centers N= 120 0.97* (0.00) 
Program quality ITERS-R Centers N=73 0.57 (0.05) 


Source: State validation report for Massachusetts. 


Notes: Positive results for standardized mean difference favor the intervention group; negative results favor the 
comparison group. 


* Finding is statistically significant after adjusting for multiple comparisons if necessary. 
CIS = Arnett Caregiver Interaction Scale. 

ECERS-R = Early Childhood Environment Rating Scale-Revised. 

ITERS-R = Infant/Toddler Environment Rating Scale-Revised. 
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Minnesota Parent Aware validation study 


Setting: The study includes early care and education programs across Minnesota. Center- 
based care programs and family child care programs are included. 


Participants: 294 programs, 66 of which are 1- or 2-star-rated programs, and 228 of which 
are 3- or 4-star-rated programs. From 872 to 913 children are in the analysis sample, depending 
on outcome. The sample included 13 percent of the 2,247 programs participating in Parent 
Aware at the time of the study. 


Program quality outcomes were collected from a subset of classrooms within programs. 
Child development outcomes were collected from a sample of children in a selected classroom 
within each program. 


Sample characteristics: For the sample of children in participating programs with 
nonmissing data (568 to 1,181 children depending on the characteristic), the average age in fall 
was 4.22 years and the average age in spring was 4.65 years. Females made up 49 percent of the 
sample and males made up 51 percent of the sample. Almost two-thirds (64 percent) were white, 
15 percent were African American, 4 percent were Asian, 4 percent were Hispanic, with the 
remainder in other categories or missing. Children from low-income households made up 62 
percent of the sample, 35 percent were from high-income households, and 3 percent were 
missing data on this variable. 


Authors’ statistical approach: Comparison of unadjusted program quality means for 
higher- and lower-rated programs. HLM regression of fall-to-spring gains in child development 
assessment scores on higher-rated program indicator and child and family characteristics. 


The study also examined associations between program quality outcomes and child 
development outcomes for all children and for low-income children. Low-income children were 
defined as those from families with incomes at or below 185 percent of the federal poverty level. 


Synthesis contrast: The intervention condition was attending a program with a 3- or 4-star 
rating. The comparison condition was attending a program with a 1- or 2-star rating. 
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Table C.8. Minnesota findings for child development outcomes 


Domain 


Alphabetics 


Alphabetics 


Comprehension 
Cognition 
Mathematics 


Social-emotional 
development 


Social-emotional 
development 


Social-emotional 
development 


Social-emotional 
development 


Source: 
Note: 


Outcome measure 


TOPEL Phonological 
Awareness 


TOPEL Print 
Awareness 


IGDI Picture Naming 
Peg Tapping 
WJ III Applied Problems 


PLBS Attention- 
Persistence 


SCBE Anger- 
Aggression 


SCBE Anxiety- 
Withdrawal 


SCBE Social 
Competency 


State validation report for Minnesota. 
Positive results for mean difference and effect size favor the intervention group; negative results favor the 


Preschoolers 


Preschoolers 


Preschoolers 
Preschoolers 
Preschoolers 


Preschoolers 


Preschoolers 


Preschoolers 


Preschoolers 


Sample 
size 
N = 872 
N = 898 
N = 913 
N = 899 
N= 891 
N = 887 
N = 908 
N = 907 
N = 903 


Statistically 
adjusted effect 
size (p-value) 


0.17 (0.07) 


0.06 (0.49)» 


0.01 (0.89) 
0.07 (0.46)» 
0.02 (0.83)» 
0.15 (0.10)® 


-0.08 (0.39)® 
-0.09 (0.34) 


0.29* (0.00)® 


Used in 
domain 
average of 
ilaceliate Mm tarels 
meet baseline 
(Ye (VIN E-V(=aler=) 
standard? 


comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 


can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations and a 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 


> Effect size is calculated as a difference-in-differences adjustment to Hedges’ g. A statistically adjusted effect size 
was not calculated because regression-adjusted means or coefficients were not available. 


* Finding is statistically significant after adjusting for multiple comparisons if necessary. 
TOPEL = Test of Preschool Early Literacy. 
IGDI = Individual Growth and Developing Indicators. 
WJ IIl = Woodcock Johnson Tests of Achievement. 

PLBS = Preschool Learning Behaviors Scale. 
SCBE = Social Competence and Behavior Evaluation. 
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Table C.9. Minnesota findings for child development outcomes for low- 
income children 


Domain 


Alphabetics 


Alphabetics 


Comprehension 
Cognition 
Mathematics 


Social-emotional 
development 


Social-emotional 
development 


Social-emotional 
development 


Social-emotional 
development 


Outcome measure 


TOPEL Phonological 
Awareness 


TOPEL Print 
Awareness 


IGDI Picture Naming 
Peg Tapping 
WJ Ill Applied Problems 


PLBS Attention- 
Persistence 


SCBE Anger- 
Aggression 


SCBE Anxiety- 
Withdrawal 


SCBE Social 
Competency 


Sample of low- 


aexeyani=) Fe Taye) (=) 

children size 
Preschoolers N= 512 
Preschoolers N = 533 
Preschoolers N = 545 
Preschoolers N = 534 
Preschoolers N = 526 
Preschoolers N = 534 
Preschoolers N = 549 
Preschoolers N = 548 
Preschoolers N = 546 


Statistically 
adjusted effect 
size (p-value) 


0.29 (0.10)? 
0.16 (0.33)® 


0.12 (0.48)° 
0.07 (0.68)» 
0.09 (0.61)® 
0.05 (0.74)® 


-0.06 (0.72) 
-0.22 (0.19) 


0.39 (0.02)» 


Used in 
domain 
EW Le [Mey | 
bitarel faye Fm uate 
meet baseline 
Yo LUTNVZ-1 (=) aver) 
standard? 


Source: 
Note: 


State validation report for Minnesota. 
Positive results for mean difference and effect size favor the intervention group; negative results favor the 


comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 
can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations and a 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 


> Effect size is calculated as a difference-in-differences adjustment to Hedges’ g. A statistically adjusted effect size 
was not calculated because regression-adjusted means or coefficients were not available. 


TOPEL = Test of Preschool Early Literacy. 
IGDI = Individual Growth and Developing Indicators. 
W4J Ill = Woodcock Johnson Tests of Achievement. 

PLBS = Preschool Learning Behaviors Scale. 
SCBE = Social Competence and Behavior Evaluation. 
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Table C.10. Minnesota findings for program quality outcomes 


Standardized 


mean difference 


Domain Outcome measure ¥-Tan] o) (sed P4) (p-value) 
Program quality CLASS Classroom Centers N = 261 0.02 (0.91) 
Organization 
Program quality CLASS Emotional Support Centers N = 261 0.07 (0.65) 
Program quality CLASS Instructional Centers N = 261 0.07 (0.62) 
Support 
Program quality ECERS-E Literacy Centers N = 145 0.70* (0.00) 
Program quality ECERS-E Literacy Homes N= 57 -0.17 (0.53) 
Program quality ECERS-E Mathematics Centers N = 145 0.57* (0.00) 
Program quality ECERS-E mathematics Homes N= 57 -0.09 (0.75) 
Program quality ECERS-R Centers N = 146 0.59* (0.00) 
Program quality FCCERS-R Homes N=57 0.06 (0.82) 


Source: State validation report for Minnesota. 


Note: Positive results for standardized mean difference favor the intervention group; negative results favor the 
comparison group. 


* Finding is statistically significant after adjusting for multiple comparisons if necessary. 
CLASS = Classroom Assessment Scoring System. 

ECERS-E = Early Childhood Environment Rating Scale-Revised (Extension). 
ECERS-R = Early Childhood Environment Rating Scale-Revised. 

FCCERS-R = Family Child Care Environment Rating Scale-Revised. 
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Ohio Step Up To Quality (SUTQ) validation study 


Setting: The study includes state-registered early childhood sites in Ohio that had 
preschool-age enrollment. The sample includes private child care centers, home care providers, 
and elementary schools. 


Participants: 72 programs, 25 of which are 1- or 2-star-rated programs, and 47 of which are 
3-, 4-, 5-star-rated programs. From 289 to 325 children in the analysis sample, depending on 


outcome. 


Program quality outcomes were collected from a subset of classrooms within programs. 
Child development outcomes were collected from all eligible children within programs. 


Sample characteristics: Sample characteristics are not provided. 
Authors’ statistical approach: ANOVA to compare whether average program quality 
outcomes and average child development outcomes differed significantly by individual rating 


level and by higher- and lower-rated grouping. 


The study also examined associations between program quality outcomes and child 
development outcomes but did not report any findings. 


Synthesis contrast: The intervention condition was attending a program with a 3-, 4-, or 5- 
star rating. The comparison condition was attending a program with a 1- or 2-star rating. 


Table C.11. Ohio findings for child development outcomes 


UEY-Ye Tame leyant-lia) 


average of 
Statistically bilevel faye Fmt s 
adjusted meet baseline 
effect size (Ye (VIN E-V(-alez=) 
Domain Outcome measure (p-value) standard? 
General reading Brigance IED Literacy Preschoolers N=289 — -0.03 (0.83) Xx 
achievement 
Language Brigance IED Language Preschoolers N=325 0.29* (0.02) 
development development 


Source: State validation report for Ohio. 


Note: Positive results for mean difference and effect size favor the intervention group; negative results favor the 
comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 
can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations anda 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 

> Effect size is calculated as a difference-in-differences adjustment to Hedges’ g. A statistically adjusted effect size 
was not calculated because regression-adjusted means or coefficients were not available. 

* Finding is statistically significant after adjusting for multiple comparisons if necessary. 


IED = Inventory of Early Development. 
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Table C.12. Ohio findings for program quality outcomes 


Standardized 
mean difference 
Domain Outcome measure Sample (p-value) 
Program quality CHELLO Homes N= 16 1.34 (0.02) 
Program quality CIS Homes N= 16 0.66 (0.22) 
Program quality CLASS Classroom Organization (pre-K) Centers N = 82 0.52 (0.03) 
Program quality CLASS Emotional Support (pre-K) Centers N = 82 0.63 (0.01) 
Program quality CLASS Emotional and Behavioral Centers N= 45 0.10 (0.73) 
Support (toddler) 
Program quality CLASS Engaged Support for Learning Centers N= 45 -0.24 (0.43) 
(toddler) 
Program quality CLASS Instructional Support (pre-K) Centers N = 82 0.48 (0.05) 
Program quality CLASS Responsive Caregiving (Infant) Centers N = 36 0.54 (0.12) 
Program quality ECERS-3 Centers N = 81 0.44 (0.07) 
Program quality ELLCO General Classroom Centers N = 82 0.42 (0.08) 
Environment 
Program quality ELLCO Language and Literacy Centers N = 82 0.42 (0.09) 
Program quality FCCERS-R Homes N= 10 1.96 (0.01) 
Program quality ITERS-R Centers N =80 0.60 (0.02) 


Source: State validation report for Ohio. 


Note: Positive results for standardized mean difference favor the intervention group; negative results favor the 
comparison group. 


CHELLO = Child Home Early Language and Literacy Observation. 
CIS = Arnett Caregiver Interaction Scale. 

CLASS = Classroom Assessment Scoring System. 

ECERS-R = Early Childhood Environment Rating Scale-3. 

ELLCO = Early Language and Literacy Classroom Observation. 
FCCERS-R = Family Child Care Environment Rating Scale-Revised. 
ITERS-R = Infant/Toddler Environment Rating Scale-Revised. 
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Oregon TQRIS validation study 


Setting: The study includes regulated early care and education programs across Oregon, 
including registered family, certified family, and certified center programs. The programs in the 
sample served toddlers and preschoolers ages 15 to 60 months. 


Participants: 304 programs, 149 of which are Level 1- or Level 2-rated programs, and 155 
of which are Level 3-, 4-, or 5-rated programs. The sample included 85 percent of programs fully 
participating in the Oregon TQRIS rating process at the time of the study. 


Sample characteristics: Twenty-one percent of the sample programs were registered family 
child care programs, 30 percent were certified family, and 49 percent were certified centers. 


Program quality outcomes were collected from a subset of classrooms within programs. 


Authors’ statistical approach: ANOVA to compare whether average program quality 
outcomes differed significantly by higher and lower rating category. 


Synthesis contrast: The intervention condition was attending a program with a Level 3, 
Level 4, or Level 5 rating. The comparison condition was attending a program with a Level | or 
Level 2 rating. 

Table C.13. Oregon findings for program quality outcomes 


Standardized 


mean difference 


Domain Outcome measure Sample size PoELIU:)) 

Program quality CLASS Classroom Centers and homes N = 259 0.42* (0.00) 
Organization 

Program quality CLASS Emotional Support Centers and homes N = 304 0.26* (0.02) 

Program quality CLASS Instructional Support © Centers and homes N = 304 0.44* (0.00) 


Source: State validation report for Oregon. 


Note: Positive results for standardized mean difference favor the intervention group; negative results favor the 
comparison group. 


* Finding is statistically significant after adjusting for multiple comparisons if necessary. 
CLASS = Classroom Assessment Scoring System. 


C-16 


APPENDIX C SYNTHESIS OF TQRIS VALIDATION STUDIES 


Rhode Island BrightStars validation study 


Setting: The study includes center-based early care and education programs that were rated 
under BrightStars in Rhode Island. 


Participants: 71 programs, 42 of which are 1- or 2-star-rated programs, and 29 of which are 
3-, 4-, or 5-star-rated programs. 299 to 331 children in the analysis sample, depending on 
outcome. The sample included 29 percent of the 242 programs participating in BrightStars at the 
time of the study. 


Program quality outcomes were collected from a selected classroom within each program. 
Child development outcomes were collected from a sample of children within a selected 
classroom within each program. 


Sample characteristics: For the sample of children participating in the study, the average 
age was 4 years and 4 months, with a range from 36.5 months to 63.7 months. One-quarter of 
participating families received a subsidy to attend child care. Forty-one percent had an annual 
household income of $85,000 or more, and more than half of the responding parents from 
participating families had a bachelor’s degree or higher. 


Authors’ statistical approach: Regression of program quality outcomes on an ordinal star 
rating variable. HLM regression of child development outcomes on an ordinal star rating 
variable, controlling for baseline scores and child and family characteristics. HLM regression of 
child development outcomes on an ordinal star rating and the ordinal star rating interacted with a 
measure of low-income, controlling for baseline scores and child and family characteristics. 
Low-income children were defined as those from families with incomes at or below 185 percent 
of the federal poverty level. 


Synthesis contrast: The intervention condition was attending a program with a 3-, 4-, or 5- 
star rating. The comparison condition was attending a program with a 1- or 2-star rating. 
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Table C.14. Rhode Island findings for child development outcomes 


Used in domain 


average of 
Statistically findings that 
EXe| [UE cre | meet baseline 
effect size Yo LUTNVE-1 (=) aver) 
Domain Outcome measure (p-value) standard? 
Alphabetics W4J Ill Letter-Word Preschoolers N = 331 0.02 (0.88)? 
Identification 
Alphabetics W4J Ill Picture Vocabulary Preschoolers N = 331 -0.16 (0.17) Xx 
Cognition Pencil Tap Task Preschoolers N = 320 0.08 (0.49)> x 
Mathematics W4J III Applied Problems Preschoolers N = 327 0.09 (0.41)> x 
Social-emotional PLBS Preschoolers N = 306 -0.11 (0.35)° 


development 


Social-emotional SCBE Anger-Aggression Preschoolers N = 299 0.18 (0.12)° 
development 


Social-emotional © SCBE Anxiety- Preschoolers N = 301 0.08 (0.48) > Xx 
development Withdrawal 

Social-emotional SCBE Social Preschoolers N = 300 0.13 (0.27)> x 
development Competence 


Source: State validation report for Rhode Island. 


Note: Positive results for mean difference and effect size favor the intervention group; negative results favor the 
comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 
can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations and a 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 


> Effect size is calculated as a difference-in-differences adjustment to Hedges’ g. A statistically adjusted effect size 
was not calculated because regression-adjusted means or coefficients were not available. 


WJ III = Woodcock Johnson Tests of Achievement. 
PLBS = Preschool Learning Behaviors Scale. 
SCBE = Social Competence and Behavior Evaluation. 
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Table C.15. Rhode Island findings for child development outcomes for low- 
income children 


Domain 


Alphabetics 


Alphabetics 
Cognition 
Mathematics 


Social-emotional 
development 


Social-emotional 
development 


Social-emotional 
development 


Social-emotional 
development 


Outcome measure 


WJ III Letter-Word 
Identification 


W4J Ill Picture Vocabulary 
Pencil Tap Task 

WJ Ill Applied Problems 
PLBS 


SCBE Anger-Aggression 


SCBE Anxiety-Withdrawal 


SCBE Social Competence 


Sample of 
oy VET atexey gaz) 
children 


Preschoolers 


Preschoolers 
Preschoolers 
Preschoolers 


Preschoolers 


Preschoolers 


Preschoolers 


Preschoolers 


N = 100 
N = 100 
N = 98 
N = 98 
N=91 
N = 90 
N=91 
N=91 


Statistically 


adjusted 
effect size 
(p-value) 


0.22 (0.28)? 


-0.09 (0.67) 
-0.02 (0.92)® 
0.10 (0.64) ° 
-0.17 (0.43) 


0.23 (0.28)° 
0.23 (0.29)° 


0.35 (0.11)® 


UEY-Ye MT ame Ceynar-lia) 


average of 
findings that 
meet baseline 
Yo LOIN VE-1(= aver) 
standard? 


Source: 
Note: 


State validation report for Rhode Island. 
Positive results for mean difference and effect size favor the intervention group; negative results favor the 


comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 


can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations and a 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 


> Effect size is calculated as a difference-in-differences adjustment to Hedges’ g. A statistically adjusted effect size 
was not calculated because regression-adjusted means or coefficients were not available. 


WJ Ill = Woodcock Johnson Tests of Achievement. 


PLBS = Preschool Learning Behaviors Scale. 
SCBE = Social Competence and Behavior Evaluation. 
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Table C.16. Rhode Island findings for program quality outcomes 


Standardized 
Fe Tale) (=) raatey-TaMmellixs) ac) aex:) 
Domain Outcome measure Sample size (p-value) 
Program quality CLASS Classroom Organization (pre-K) Centers N=71 0.40 (0.10) 
Program quality CLASS Emotional Support (pre-K) Centers N=71 0.53* (0.03) 
Program quality CLASS Emotional and Behavioral Centers N = 32 1.25* (0.00) 
Support (toddler) 
Program quality CLASS Engaged Support for Learning Centers N= 51 1.43* (0.00) 
(toddler) 
Program quality CLASS Instructional Support (pre-K) Centers N= 52 0.65* (0.02) 


Source: State validation report for Rhode Island. 

Note: Positive results for standardized mean difference favor the intervention group; negative results favor the 
comparison group. 

* Finding is statistically significant after adjusting for multiple comparisons if necessary. 

CLASS = Classroom Assessment Scoring System. 
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Washington Early Achievers validation study 


Setting: The study includes early childhood education programs enrolled in Early Achievers 
in Washington. The setting included programs for infants, toddlers, and preschoolers in child 
care centers, family child care programs, Head Start, and Early Childhood Education and 
Assistance Program sites. 


Participants: 100 programs. 100 to 412 children in the analysis sample, depending on 
outcome. Across the children enrolled in the 100 programs, nineteen percent were enrolled in 
Level 4-rated programs, 59 percent were enrolled in Level 3-rated programs, and 11 percent 
were enrolled in Level 2-rated programs. (The remaining children were enrolled in unrated 
programs and were not included in the synthesis analysis samples.) The sample included 4 
percent of the 2,303 programs enrolled in Early Achievers at the time of the study. 


Program quality outcomes were collected from a subset of classrooms within programs. 
Child development outcomes were collected from a sample of children within programs. 


Sample characteristics: For the sample of children participating in the study, about half 
were boys and most spoke English (84 percent). About one-third of the sample were infants or 
toddlers and two-thirds were preschoolers. About 60 percent were white and 26 percent were 
another race (14 percent missing). The sample was 14 percent Latino and 72 percent other race 
(14 percent missing). Twenty-three percent of families reported receiving subsidies. Thirty-five 
percent of children had parents with at least a bachelor's degree. About one-third of the sample 
were infants or toddlers and two-thirds were preschool age. 


Authors’ statistical approach: HLM regression of child development outcomes on rating 
level, controlling for baseline scores and child and family characteristics. 


The study also examined associations between program quality outcomes and child 
development outcomes. 


Synthesis contrast: The intervention condition was attending a program with a Level 3 or 
Level 4 rating. The comparison condition was attending a program with a Level 2 rating. 
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Table C.17. Washington findings for child development outcomes 


UFT-Ye Tamme (oyanr-1ia) 


EN Velo (-Me) i 
Statistically bitarel lave mutes 
FXe| [UR cre | meet baseline 
effect size equivalence 

Domain Outcome measure (p-value) standard? 
Alphabetics EWA Name Preschoolers N = 412 -0.06 (0.70) Xx 
Alphabetics EWA Words Preschoolers N = 405 0.01 (0.97) Xx 
Alphabetics W4J Ill Letter-Word Preschoolers N = 409 0.06 (0.68) Xx 

Identification 
Comprehension PPVT Preschoolers N=397  0.35* (0.03) x 
Language MSEL Expressive Infants and toddlers N=155 0.29 (0.18) Xx 
development Language 
Language MSEL Receptive Infants and toddlers N=174 0.32 (0.10) Xx 
development Language 
Cognition HTKS Preschoolers N = 396 0.15 (0.32) x 
Cognition MSEL Visual Infants and toddlers N= 166 0.29 (0.15) x 

Reception 
Mathematics TEAM Preschoolers N = 403 0.20 (0.20) Xx 
Science LENS Preschoolers N = 159 0.20 (0.33) Xx 
Social-emotional CBCL Preschoolers N = 222 -0.07 (0.74) 
development 
Social-emotional CBCL Toddlers N = 100 0.02 (0.94) 
development 
Motor skills MSEL Fine Motor Infants and toddlers N=174 0.49% (0.01) xX 
Motor skills MSEL Gross Motor Infants and toddlers N= 138 0.32 (0.14) x 

Source: State validation report for Washington. 
Note: Positive results for mean difference and effect size favor the intervention group; negative results favor the 


comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 
can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations and a 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 


* Finding is statistically significant after adjusting for multiple comparisons if necessary. 
EWA = Early Writing Assessment. 

W4J Ill = Woodcock Johnson Tests of Achievement. 

PPVT = Peabody Picture Vocabulary Test. 

MSEL = Mullen Scales of Early Learning. 

HTKS = Head-Toes-Knees-Shoulders. 

TEAM = Tools for Early Assessment in Math. 

LENS = Lens on Science. 

CBCL = Child Behavior Checklist. 
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Wisconsin YoungStar validation study 


Setting: The study includes both family and group child care providers participating in the 
YoungStar program in Wisconsin. 


Participants: 239 classrooms in 155 programs. Of the 239 classrooms, 108 were rated at the 
2 Star level, 102 at the 3 Star level, 7 at the 4 Star level, and 22 at the 5 Star level. 603 to 725 
children in the analysis sample, depending on outcome. 


Program quality outcomes were collected from a subset of classrooms within programs. 
Child development outcomes were collected from a sample of children within programs. 


Sample characteristics: For the sample of children participating in the study, most children 
were either white (80 percent) or black (16 percent). About 62 percent of children resided in two- 
parent households, and slightly fewer than half had a parent with at least a four-year college 
degree. On average, families reported their incomes of $78,787, and a little more than one- 
quarter of families received child care subsidies. Nearly all children (98 percent) spoke English 
at home. 


Authors’ statistical approach: Comparison of regression-adjusted program quality 
outcome means between higher- and lower-rated groups and among individual rating levels. 
HLM regression of child development outcomes on rating-level group (higher contrasted with 
lower and individual rating levels contrasted with one another), controlling for baseline scores, 
child and family characteristics, and provider type and region. 


The study also examined associations between program quality outcomes and child 
development outcomes. 


Synthesis contrast: The intervention condition was attending a program with a 3-, 4-, or 5- 
star rating. The comparison condition was attending a program with a 2-star rating. 
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Table C.18. Wisconsin findings for child development outcomes 


UFY-Yo Mame (eyarr-lia) 


EW1e-\e (Me) | 


Statistically bilevel fave Fm uate s 
FXe| [Ui c-re| meet baseline 
effect size equivalence 
Domain Outcome measure Sample Sample size (p-value) standard? 
Alphabetics TOPEL Preschoolers N = 639 0.01 (0.85) xX 
Alphabetics W4J Ill Letter-Word Preschoolers N = 725 -0.10 (0.20) Xx 
Identification 
Cognition BSRA Preschoolers N = 725 -0.05 (0.53) x 
Cognition HTKS Preschoolers N = 725 -0.05 (0.48) x 
Mathematics WJ III Applied Preschoolers N = 725 -0.08 (0.31) Xx 
Problems 
Social-emotional PLBS Preschoolers N = 604 -0.02 (0.84) 
development 
Social-emotional SCBE Anger- Preschoolers N = 603 -0.03 (0.68) 
development Aggression 
Social-emotional SCBE Anxiety- Preschoolers N = 603 0.06 (0.44)? 
development Withdrawal 
Social-emotional SCBE Social Preschoolers N = 603 0.07 (0.40)° 
development Competence 


Source: State validation report for Wisconsin. 


Note: Positive results for mean difference and effect size favor the intervention group; negative results favor the 
comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 
can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations and a 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 


> Effect size is calculated as a difference-in-differences adjustment to Hedges’ g. A statistically adjusted effect size 
was not calculated because regression-adjusted means or coefficients were not available. 


TOPEL = Test of Preschool Early Literacy. 

W4J Ill = Woodcock Johnson Tests of Achievement. 
BSRA = Bracken School Readiness Assessment. 
HTKS = Head-Toes-Knees-Shoulders. 

PLBS = Preschool Learning Behaviors Scale. 

SCBE = Social Competence and Behavior Evaluation. 
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Table C.19. Wisconsin findings for child development outcomes for low- 
income children 


UEY-Ye Tame Ceyuir-ia) 
average of 


Statistically itatel late Mm tarels 
Sample of adjusted meet baseline 
low-income effect size equivalence 
Domain Outcome measure children Sample size (p-value) standard? 
Alphabetics TOPEL Preschoolers N=119 0.08 (0.69)° 
Alphabetics W4J Ill Letter-Word Preschoolers N = 129 -0.03 (0.87)° 
Identification 
Cognition BSRA Preschoolers N = 129 -0.14 (0.43)° 
Cognition HTKS Preschoolers N = 128 -0.11 (0.56)° 
Mathematics W4J III Applied Preschoolers N = 129 -0.09 (0.61) 
Problems 
Social-emotional PLBS Preschoolers N = 99 -0.03 (0.90)° 
development 
Social-emotional SCBE Anger- Preschoolers N = 98 0.01 (0.96) > 
development Aggression 
Social-emotional SCBE Anxiety- Preschoolers N =98 -0.07 (0.74)° 
development Withdrawal 
Social-emotional SCBE Social Preschoolers N =98 -0.38 (0.07) 
development Competence 


Source: State validation report for Wisconsin. 


Note: Positive results for mean difference and effect size favor the intervention group; negative results favor the 
comparison group. The effect size is a standardized measure of the effect of an intervention on student 
outcomes, representing the change (measured in standard deviations) in an average child’s outcome that 
can be expected if the student receives the intervention. 


4 Findings meet the baseline equivalence standard (1) if the baseline difference between groups is less than 0.05 
standard deviations or (2) if the baseline difference between groups is 0.05 to 0.25 standard deviations and a 
statistical adjustment for the baseline score was made. When a statistically adjusted effect size and a difference-in- 
differences effect size were calculated, the statistically adjusted effect size will be used in the average. 


> Effect size is calculated as a difference-in-differences adjustment to Hedges’ g. A statistically adjusted effect size 
was not calculated because regression-adjusted means or coefficients were not available. 


TOPEL = Test of Preschool Early Literacy. 

W4J Ill = Woodcock Johnson Tests of Achievement. 
BSRA = Bracken School Readiness Assessment. 
HTKS = Head-Toes-Knees-Shoulders. 

PLBS = Preschool Learning Behaviors Scale. 

SCBE = Social Competence and Behavior Evaluation. 
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Table C.20. Wisconsin findings for program quality outcomes 


Standardized 
mean difference 
Domain Outcome measure S¥-Tan] o) (2-14) (p-value) 
Program quality ECERS-R/FCCERS-R _ Centers and N = 239 0.62* (0.00) 
homes 


Source: State validation report for Wisconsin. 


Note: Positive results for standardized mean difference favor the intervention group; negative results favor the 
comparison group. Wisconsin also calculated group means controlling for region. The standardized mean 
difference (and p-value) using the regression-adjusted means is 0.64* (0.00). 


* Finding is statistically significant after adjusting for multiple comparisons if necessary. 
ECERS-R = Early Childhood Environment Rating Scale-Revised. 
FCCERS-R = Family Child Care Environment Rating Scale-Revised. 
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APPENDIX D SYNTHESIS OF TQRIS VALIDATION STUDIES 


This appendix provides complete findings on all of the challenges that we discussed in the 
interviews with researchers who conducted the validation studies. 


Table D.1. Detailed findings about challenges 


Number of states Number of states Number of states 
fer-1K-Ye [olay late pr: l-) fer-}-Ye [olay Alale pe: l-) fer-1i-Ye [ol ari ale p-l-) 


fod aF-Vi(evate[-9 major challenge lanliacey med are licevare =) not a challenge 


Deciding which rating levels to validate 1 4 4 


Design limited the interpretation of findings 3 


Recruiting programs 4 4 1 
Recruiting children fi 4 4 
Low response rates from programs 5 4 
Low response rates from children 2 5 


Attaining sufficient representation across 
program types and rating levels 5 3 1 


Selecting measures of program quality 


Administering measures of program quality 


0 

0 
Analyzing measures of program quality 0 
Selecting measures of child development 2 
Administering measures of child development 0 
Analyzing measures of child development 1 
Obtaining administrative data? 
Analyzing administrative data? 


Missing data for programs or children 


on ononw KYM OA NYY OW 
FM NY WOW FN NY NN SF 


Limited data on family and child characteristics 


Conducting study before TQRIS were fully 
implemented? 4 2 


Collecting data in the allotted study time frame 1 5 


Sources: Interviews with researchers who conducted the state validation reports in California, Delaware, 
Massachusetts, Minnesota, Ohio, Oregon, Rhode Island, Washington, and Wisconsin. 


9 Applies to only eight states. 
> Applies to only six states. 
TQRIS = tiered quality rating and improvement system. 
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